digitalmars.D - Error: 4invalid UTF-8 sequence

jicman (33/33) Feb 28 2005 Greetings! And sorry about the revisit of "Error: 4invalid UTF-8 sequen...

Regan Heath (12/46) Mar 01 2005 How are you saving it? in what format/encoding?

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (15/28) Mar 01 2005 There is no support in the D language or libraries for legacy encodings,

jicman (3/31) Mar 01 2005 Thanks.

jicman (12/61) Mar 01 2005 I don't save it. A software using IE as client allows for data entry an...

Regan Heath (13/75) Mar 01 2005 Then the question is "What encoding does it save the character data in?"

jicman (16/93) Mar 01 2005 Here is a response from the server:

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/5) Mar 01 2005 You're in luck then. It's by far the simplest to convert to UTF...

jicman (28/33) Mar 01 2005 I don't have time right now... (time constraint!), but did came up with ...

jicman <jicman_member pathlink.com> writes:

Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8 sequence."

Let's say that I am working with a data that contains names with accented
charaters from all over the world and they are giving me problems. ie.

...
...
0 forms took 0.397589 sec || Avg forms/sec = 5.34942
----------------------------------------------------------------
--  725
Counting forms for yrajau (Rajau, Yannis)               --
Application :       Qty   Deleted      Left     Total
Distribute :         1         0         1       840
Total Forms :         1         0         0      2461
1 forms took 0.327413 sec || Avg forms/sec = 5.34778
----------------------------------------------------------------
--  726
Counting forms for CGiunta (Giunta, Cosmo A)            --
Application :       Qty   Deleted      Left     Total
Distribute :         6         0         6       846
Total Forms :         6         0         0      2467
6 forms took 0.589351 sec || Avg forms/sec = 5.35397
----------------------------------------------------------------
--  727
Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

...
...

So, I need to be able to change that charater in order to print it.  The
character causing the problem is a "�" which we already have figured out how to
save.  But, I have lots of data that has some of these charaters and it's
causing problems for writefln.  Any ideas how to change a non-UTF-8 string to a
UTF-8 string?

thanks.

Going to bed.  Worked on this for too long.

jos�

Feb 28 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8  
 sequence."

 Let's say that I am working with a data that contains names with accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.  The
 character causing the problem is a "�" which we already have figured out  
 how to
 save.

How are you saving it? in what format/encoding?

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As  
this isn't working, I assume you've saved it in another encoding.

So, to do this you load the data you've saved into a byte[] or ubyte[]  
then write (or find) a function that converts from your encoding into  
utf-8, utf-16 or utf-32, call that, and print the result.

If you cannot write/find a function, ask here, someone will either have  
one, or write one, most likely.

Where's Arcane Jill when we need her?

Regan

Mar 01 2005

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Regan Heath wrote:

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a UTF-8 string?

 
 If you had saved it in utf-8, you could simply load it and print it. As  
 this isn't working, I assume you've saved it in another encoding.
 
 So, to do this you load the data you've saved into a byte[] or ubyte[]  
 then write (or find) a function that converts from your encoding into  
 utf-8, utf-16 or utf-32, call that, and print the result.
 
 If you cannot write/find a function, ask here, someone will either have  
 one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings,
but I provided three different methods: latin-1 cast, lookup or libiconv

1)
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
(see the "8-bit encodings" section for sample code)

2)
http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
http://www.algonet.se/~afb/d/mapping.zip

3)
http://www.algonet.se/~afb/d/libiconv.d
http://www.gnu.org/software/libiconv/ (has a lot of different encodings)

I suggest "ubyte[]", to avoid any issues with signs when converting ?
Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way

--anders

Mar 01 2005

jicman <jicman_member pathlink.com> writes:

Thanks.


In article <d02lop$qlu$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Regan Heath wrote:

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a UTF-8 string?

 
 If you had saved it in utf-8, you could simply load it and print it. As  
 this isn't working, I assume you've saved it in another encoding.
 
 So, to do this you load the data you've saved into a byte[] or ubyte[]  
 then write (or find) a function that converts from your encoding into  
 utf-8, utf-16 or utf-32, call that, and print the result.
 
 If you cannot write/find a function, ask here, someone will either have  
 one, or write one, most likely.

There is no support in the D language or libraries for legacy encodings,
but I provided three different methods: latin-1 cast, lookup or libiconv

1)
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
(see the "8-bit encodings" section for sample code)

2)
http://www.algonet.se/~afb/d/mapping.d (wchar[256] lookup tables)
http://www.algonet.se/~afb/d/mapping.zip

3)
http://www.algonet.se/~afb/d/libiconv.d
http://www.gnu.org/software/libiconv/ (has a lot of different encodings)

I suggest "ubyte[]", to avoid any issues with signs when converting ?
Got my tables from http://www.unicode.org/Public/MAPPINGS/, by the way

--anders

Mar 01 2005

jicman <jicman_member pathlink.com> writes:

In article <opsmy79sem23k2f5 ally>, Regan Heath says...
On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8  
 sequence."

 Let's say that I am working with a data that contains names with accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.  The
 character causing the problem is a "�" which we already have figured out  
 how to
 save.

How are you saving it? in what format/encoding?

I don't save it.  A software using IE as client allows for data entry and that's
how jos� was entered.  I am just dumping lots of xml from that server and it's
always breaks on jos�.

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8  
 string to a
 UTF-8 string?

If you had saved it in utf-8, you could simply load it and print it. As  
this isn't working, I assume you've saved it in another encoding.

But I didn't.  It must be WindoZE or Windows, as others call it.   There are two
ways of entering an � on the computer.
1. Using the ALT key + 130 on the number keys on the right side of the keyboard
or having two keyboards on your system and changing keyboards when needed.

So, to do this you load the data you've saved into a byte[] or ubyte[]  
then write (or find) a function that converts from your encoding into  
utf-8, utf-16 or utf-32, call that, and print the result.

Yeah, I was thinking that I may have to do this, or something... :-)

If you cannot write/find a function, ask here, someone will either have  
one, or write one, most likely.

Where's Arcane Jill when we need her?

Yeah, where is she?

thanks.

jic

Mar 01 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 In article <opsmy79sem23k2f5 ally>, Regan Heath says...
 On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
 <jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
 sequence."

 Let's say that I am working with a data that contains names with  
 accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.   
 The
 character causing the problem is a "�" which we already have figured  
 out
 how to
 save.

 How are you saving it? in what format/encoding?

 I don't save it.  A software using IE as client allows for data entry  
 and that's
 how jos� was entered.  I am just dumping lots of xml from that server  
 and it's
 always breaks on jos�.

Then the question is "What encoding does it save the character data in?"

 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8
 string to a
 UTF-8 string?

 If you had saved it in utf-8, you could simply load it and print it. As
 this isn't working, I assume you've saved it in another encoding.

 But I didn't.  It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS.

A program "A software using IE as client" has saved the data in a certain  
encoding.

You're reading that data, into a char[], and then printing it with writef,  
which finds an invalid UTF-8 character, because the data isn't UTF-8  
encoded, it's something else.

 There are two
 ways of entering an � on the computer.
 1. Using the ALT key + 130 on the number keys on the right side of the  
 keyboard
 or having two keyboards on your system and changing keyboards when  
 needed.

Sure, and when you enter that '�' the program you enter it into has _lots_  
of different options as to how to encode it. UTF-8 is the option you need  
it to take, or, you need to transcode from the option it uses, to UTF-8.

Regan

Mar 01 2005

jicman <jicman_member pathlink.com> writes:

In article <opsmzb2xhn23k2f5 ally>, Regan Heath says...
On Tue, 1 Mar 2005 21:35:27 +0000 (UTC), jicman  
<jicman_member pathlink.com> wrote:
 In article <opsmy79sem23k2f5 ally>, Regan Heath says...
 On Tue, 1 Mar 2005 06:33:34 +0000 (UTC), jicman
 <jicman_member pathlink.com> wrote:
 Greetings!  And sorry about the revisit of "Error: 4invalid UTF-8
 sequence."

 Let's say that I am working with a data that contains names with  
 accented
 charaters from all over the world and they are giving me problems. ie.

 ...
 ...
 0 forms took 0.397589 sec || Avg forms/sec = 5.34942
 ----------------------------------------------------------------
 --  725
 Counting forms for yrajau (Rajau, Yannis)               --
 Application :       Qty   Deleted      Left     Total
 Distribute :         1         0         1       840
 Total Forms :         1         0         0      2461
 1 forms took 0.327413 sec || Avg forms/sec = 5.34778
 ----------------------------------------------------------------
 --  726
 Counting forms for CGiunta (Giunta, Cosmo A)            --
 Application :       Qty   Deleted      Left     Total
 Distribute :         6         0         6       846
 Total Forms :         6         0         0      2467
 6 forms took 0.589351 sec || Avg forms/sec = 5.35397
 ----------------------------------------------------------------
 --  727
 Counting forms for JCabrera (Cabrera, JosError: 4invalid UTF-8 sequence

 ...
 ...

 So, I need to be able to change that charater in order to print it.   
 The
 character causing the problem is a "�" which we already have figured  
 out
 how to
 save.

 How are you saving it? in what format/encoding?

 I don't save it.  A software using IE as client allows for data entry  
 and that's
 how jos� was entered.  I am just dumping lots of xml from that server  
 and it's
 always breaks on jos�.

Then the question is "What encoding does it save the character data in?"

Here is a response from the server:

HTTP/1.1 200 OK
Date: Tue, 01 Mar 2005 22:19:06 GMT
Server: FlowPort Web Server/FlowPort 2.2.1.88 created 6/3/03 4:07 AM
MIME-version: 1.0
Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?>

[blah- clip -blah]

<UserInfo>
<UserName>jcabrera</UserName>
<LastName>cabrera</LastName>
<FirstName>josError: 4invalid UTF-8 sequence


So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
could work.


 But, I have lots of data that has some of these charaters and it's
 causing problems for writefln.  Any ideas how to change a non-UTF-8
 string to a
 UTF-8 string?

 If you had saved it in utf-8, you could simply load it and print it. As
 this isn't working, I assume you've saved it in another encoding.

 But I didn't.  It must be WindoZE or Windows, as others call it.

Windows has nothing to do with the problem AFAICS.

A program "A software using IE as client" has saved the data in a certain  
encoding.

You're reading that data, into a char[], and then printing it with writef,  
which finds an invalid UTF-8 character, because the data isn't UTF-8  
encoded, it's something else.

 There are two
 ways of entering an � on the computer.
 1. Using the ALT key + 130 on the number keys on the right side of the  
 keyboard
 or having two keyboards on your system and changing keyboards when  
 needed.

Sure, and when you enter that '�' the program you enter it into has _lots_  
of different options as to how to encode it. UTF-8 is the option you need  
it to take, or, you need to transcode from the option it uses, to UTF-8.

Regan

again, thanks.

Mar 01 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

jicman wrote:

 So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
 could work.

You're in luck then. It's by far the simplest to convert to UTF...

--anders

Mar 01 2005

jicman <jicman_member pathlink.com> writes:

Anders_F_Bj=F6rklund?= says...
jicman wrote:

 So, it's iso-8859-1.  Maybe I could do my post and accept only UTF-8.  That
 could work.

You're in luck then. It's by far the simplest to convert to UTF...

--anders

I don't have time right now... (time constraint!), but did came up with this
little function for anyone out there to use, for a quick "print patching":

char[] CheckForUTF8(char[] name)
{
char[] outStr = null;
foreach(char c;name)
if(std.ctype.isascii(c) > 0)
outStr ~= c;
else
outStr ~= "+";
return outStr;
}

it will replace the offending character to a + and allow printing. :-)  Hey, I
didn't say it was pretty. :-)  It just allows me to print.  So, now the output
looks like:


----------------------------------------------------------------
--    6
Counting forms for jcabrera (cabrera, jos+ isa+as)      --
Application :       Qty   Deleted      Left     Total
DocumentToken :      2589         0      2589      2596
Distribute :         7         0         7        19
Total Forms :      2596         0         0      2615
2596 forms took 29.1392 sec || Avg forms/sec =  87.989

----------------------------------------------------------------

Pretty, uh? :-)

thanks for all the help and info.

jic

Mar 01 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Error: 4invalid UTF-8 sequence