www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Error: 4invalid UTF-8 sequence

reply jicman <jicman_member pathlink.com> writes:
Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."

Let's say that I am working with a data that contains names with accented
charaters from all over the world and they are giving me problems. ie.

...
...
0 forms took 0.397589 sec || Avg forms/sec = 5.34942
----------------------------------------------------------------
--  725
Counting forms for yrajau (Rajau, Yannis)               --
Application :       Qty   Deleted      Left     Total
Distribute :         1         0         1       840
Total Forms :         1         0         0      2461
1 forms took 0.327413 sec || Avg forms/sec = 5.34778
----------------------------------------------------------------
--  726
Counting forms for CGiunta (Giunta, Cosmo A)            --
Application :       Qty   Deleted      Left     Total
Distribute :         6         0         6       846
Total Forms :         6         0         0      2467
6 forms took 0.589351 sec || Avg forms/sec = 5.35397
----------------------------------------------------------------
--  727
Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

...
...

So, I need to be able to change that charater in order to print it.  The
character causing the problem is a "ť" which we already have figured out how to
save.  But, I have lots of data that has some of these charaters and it's
causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a
UTF-8 string?

thanks.

Going to bed.  Worked on this for too long.

josť
Feb 28 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8  
 sequence."

 Let's say that I am working with a data that contains names with accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.  The
 character causing the problem is a "ť" which we already have figured out  
 how to
 save.

How are you saving it? in what format/encoding?
 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. If you cannot write/find a function, ask here, someone will either have one, or write one, most likely. Where's Arcane Jill when we need her? Regan
Mar 01 2005
next sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Regan Heath wrote:

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv 1) http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (see the "8-bit encodings" section for sample code) 2) http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables) http://www.algonet.se/~afb/d/mapping.zip 3) http://www.algonet.se/~afb/d/libiconv.d http://www.gnu.org/software/libiconv/ (has a lot of different encodings) I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way --anders
Mar 01 2005
parent jicman <jicman_member pathlink.com> writes:
Thanks.


In article <d02lop$qlu$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Regan Heath wrote:

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding. So, to do this you load the data you've saved into a byte[] or ubyte[] then write (or find) a function that converts from your encoding into utf-8, utf-16 or utf-32, call that, and print the result. If you cannot write/find a function, ask here, someone will either have one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings, but I provided three different methods: latin-1 cast, lookup or libiconv 1) http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (see the "8-bit encodings" section for sample code) 2) http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables) http://www.algonet.se/~afb/d/mapping.zip 3) http://www.algonet.se/~afb/d/libiconv.d http://www.gnu.org/software/libiconv/ (has a lot of different encodings) I suggest "ubyte[]", to avoid any issues with signs when converting ? Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way --anders

Mar 01 2005
prev sibling parent reply jicman <jicman_member pathlink.com> writes:
In article <opsmy79sem23k2f5 ally>, Regan Heath says...
On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8  
 sequence."

 Let's say that I am working with a data that contains names with accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.  The
 character causing the problem is a "ť" which we already have figured out  
 how to
 save.

How are you saving it? in what format/encoding?

I don't save it. A software using IE as client allows for data entry and that's how josť was entered. I am just dumping lots of xml from that server and it's always breaks on josť.
 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

But I didn't. It must be WindoZE or Windows, as others call it. There are two ways of entering an ť on the computer. 1. Using the ALT key + 130 on the number keys on the right side of the keyboard or having two keyboards on your system and changing keyboards when needed.
So, to do this you load the data you've saved into a byte[] or ubyte[]  
then write (or find) a function that converts from your encoding into  
utf-8, utf-16 or utf-32, call that, and print the result.

Yeah, I was thinking that I may have to do this, or something... :-)
If you cannot write/find a function, ask here, someone will either have  
one, or write one, most likely.

Where's Arcane Jill when we need her?

Yeah, where is she? thanks. jic
Mar 01 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 In article <opsmy79sem23k2f5 ally>, Regan Heath says...
 On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
 <jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
 sequence."

 Let's say that I am working with a data that contains names with  
 accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.   
 The
 character causing the problem is a "ť" which we already have figured  
 out
 how to
 save.

How are you saving it? in what format/encoding?

I don't save it. A software using IE as client allows for data entry and that's how josť was entered. I am just dumping lots of xml from that server and it's always breaks on josť.

Then the question is "What encoding does it save the character data in?"
 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

But I didn't. It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS. A program "A software using IE as client" has saved the data in a certain encoding. You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.
 There are two
 ways of entering an ť on the computer.
 1. Using the ALT key + 130 on the number keys on the right side of the  
 keyboard
 or having two keyboards on your system and changing keyboards when  
 needed.

Sure, and when you enter that 'ť' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8. Regan
Mar 01 2005
parent reply jicman <jicman_member pathlink.com> writes:
In article <opsmzb2xhn23k2f5 ally>, Regan Heath says...
On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 In article <opsmy79sem23k2f5 ally>, Regan Heath says...
 On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
 <jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
 sequence."

 Let's say that I am working with a data that contains names with  
 accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.   
 The
 character causing the problem is a "ť" which we already have figured  
 out
 how to
 save.

How are you saving it? in what format/encoding?

I don't save it. A software using IE as client allows for data entry and that's how josť was entered. I am just dumping lots of xml from that server and it's always breaks on josť.

Then the question is "What encoding does it save the character data in?"

Here is a response from the server: HTTP/1.1 200 OK Date: Tue, 01 Mar 2005 22:19:06 GMT Server: FlowPort Web Server/FlowPort 2.2.1.88 created 6/3/03 4:07 AM MIME-version: 1.0 Content-Type: application/xml <?xml version="1.0" encoding="iso-8859-1"?> [blah- clip -blah] <UserInfo> <UserName>jcabrera</UserName> <LastName>cabrera</LastName> <FirstName>josError: 4invalid UTF-8 sequence So, it's iso-8859-1. Maybe I could do my post and accept only UTF-8. That could work.
 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As this isn't working, I assume you've saved it in another encoding.

But I didn't. It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS. A program "A software using IE as client" has saved the data in a certain encoding. You're reading that data, into a char[], and then printing it with writef, which finds an invalid UTF-8 character, because the data isn't UTF-8 encoded, it's something else.
 There are two
 ways of entering an ť on the computer.
 1. Using the ALT key + 130 on the number keys on the right side of the  
 keyboard
 or having two keyboards on your system and changing keyboards when  
 needed.

Sure, and when you enter that 'ť' the program you enter it into has _lots_ of different options as to how to encode it. UTF-8 is the option you need it to take, or, you need to transcode from the option it uses, to UTF-8. Regan

again, thanks.
Mar 01 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
jicman wrote:

 So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
 could work.

You're in luck then. It's by far the simplest to convert to UTF... --anders
Mar 01 2005
parent jicman <jicman_member pathlink.com> writes:
Anders_F_Bj=F6rklund?= says...
jicman wrote:

 So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
 could work.

You're in luck then. It's by far the simplest to convert to UTF... --anders

I don't have time right now... (time constraint!), but did came up with this little function for anyone out there to use, for a quick "print patching": char[] CheckForUTF8(char[] name) { char[] outStr = null; foreach(char c;name) if(std.ctype.isascii(c) > 0) outStr ~= c; else outStr ~= "+"; return outStr; } it will replace the offending character to a + and allow printing. :-) Hey, I didn't say it was pretty. :-) It just allows me to print. So, now the output looks like: ---------------------------------------------------------------- -- 6 Counting forms for jcabrera (cabrera, jos+ isa+as) -- Application : Qty Deleted Left Total DocumentToken : 2589 0 2589 2596 Distribute : 7 0 7 19 Total Forms : 2596 0 0 2615 2596 forms took 29.1392 sec || Avg forms/sec = 87.989 ---------------------------------------------------------------- Pretty, uh? :-) thanks for all the help and info. jic
Mar 01 2005