digitalmars.D.learn - Need to do some "dirty" UTF-8 handling

Nick Sabalausky (16/16) Jun 25 2011 Sometimes I need to bring data into a string, and need to be able to tre...

Vladimir Panteleev (41/46) Jun 25 2011 I tend to do this a lot, for various reasons. By my experience, a great ...

Nick Sabalausky (3/5) Jun 25 2011 That doesn't throw on an invalid sequence?

Vladimir Panteleev (7/13) Jun 26 2011 You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8....

Jonathan M Davis (10/29) Jun 25 2011 Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually ...

Nick Sabalausky (14/52) Jun 25 2011 Using immutable(ubyte)[] just causes an enormous amount of type-related

Andrej Mitrovic (7/7) Jun 25 2011 I've had a similar requirement some time ago. I've had to copy and

Nick Sabalausky (14/22) Jun 25 2011 I think I may end up doing something like that :/

Dmitry Olshansky (6/29) Jun 25 2011 std.encoding to the rescue?

Jonathan M Davis (8/45) Jun 25 2011 It's also likely going away. It was an experiment of sorts which Andrei
Nick Sabalausky (7/40) Jun 25 2011 Ahh, I didn't even notice that module.

Dmitry Olshansky (9/51) Jun 25 2011 Same here, It's just a couple of days(!) ago I somehow managed to find

Nick Sabalausky (18/70) Jun 25 2011 Yea, and even when it does go, I can just copy it and include it manuall...
Jonathan M Davis (8/67) Jun 25 2011 Oh, it'll probably be around for a while. It'll take time before a repla...

"Nick Sabalausky" <a a.a> writes:

Sometimes I need to bring data into a string, and need to be able to treat 
it as an actual "string", but don't actually care if the entire thing is 
technically valid UTF-8 or not, don't care if invalid bytes don't get 
preserved right, and can't have any utf exceptions being thrown regardless 
of the input. Yea, I know that's sloppy, but sometimes that's good enough 
and proper handling may be far more trouble than what's needed. (For 
example: Processing HTML from arbitrary URLs. It's pretty much guaranteed 
you'll come across stuff that's wrong or even has the encoding type 
improperly set. But it's usually more important for the process to succeed 
than for it to be perfectly accurate.)

Far as I can tell, this seems to currently be impossible with Phobos (unless 
you're *extremely* meticulous about watching what your entire codebase does 
with the data), which is a major pain when such a need arises.

Anyone have a good workaround? For instance, maybe a function that'll take 
in a byte array and convert *all* invalid UTF-8 sequences to a user-selected 
valid character?

Jun 25 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a a.a> wrote:

 Anyone have a good workaround? For instance, maybe a function that'll  
 take
 in a byte array and convert *all* invalid UTF-8 sequences to a  
 user-selected
 valid character?

I tend to do this a lot, for various reasons. By my experience, a great  
part of string-handling functions in Phobos will work just fine with  
strings containing invalid UTF-8 - you can generally use your intuition  
about whether a function will need to look at individual characters inside  
the string. Note, though, that there's currently a bug in D2/Phobos (6064)  
which causes std.array.join (and possibly other functions) to treat  
strings as not something that can be joined by concatenation, and do a  
character-by-character copy (which is both needlessly inefficient and will  
choke on invalid UTF-8).

When I really need to pass arbitrary data through string-handling  
functions, I use these functions:

/// convert any data to valid UTF-8, so D's string functions can properly  
work on it
string rawToUTF8(string s)
{
	dstring d;
	foreach (char c; s)
		d ~= c;
	return toUTF8(d);
}

string UTF8ToRaw(string r)
{
	string s;
	foreach (dchar c; r)
	{
		assert(c < '\u0100');
		s ~= c;
	}
	return s;
}

( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )

Of course, it would be nice if it'd be possible to only convert INVALID  
UTF-8 sequences. According to Wikipedia, the invalid Unicode code points  
U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd  
guess that a proper implementation will need to guarantee that a roundtrip  
will always return the same data as the input, so it'd have to "escape"  
the invalid code points used for escaping as well.

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Jun 25 2011

"Nick Sabalausky" <a a.a> writes:

"Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message 
news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...
 string s;
 foreach (dchar c; r)

That doesn't throw on an invalid sequence?

Jun 25 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sat, 25 Jun 2011 23:17:37 +0300, Nick Sabalausky <a a.a> wrote:

 "Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message
 news:op.vxmuvzqbtuzx1w cybershadow.mshome.net...
 string s;
 foreach (dchar c; r)

 That doesn't throw on an invalid sequence?

You use rawToUTF8 to convert an arbitrary array of chars to valid UTF-8.  
You use UTF8ToRaw to convert the output of rawToUTF8 back to the original  
string.

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Jun 26 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-06-25 02:00, Nick Sabalausky wrote:
 Sometimes I need to bring data into a string, and need to be able to treat
 it as an actual "string", but don't actually care if the entire thing is
 technically valid UTF-8 or not, don't care if invalid bytes don't get
 preserved right, and can't have any utf exceptions being thrown regardless
 of the input. Yea, I know that's sloppy, but sometimes that's good enough
 and proper handling may be far more trouble than what's needed. (For
 example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
 you'll come across stuff that's wrong or even has the encoding type
 improperly set. But it's usually more important for the process to succeed
 than for it to be perfectly accurate.)
 
 Far as I can tell, this seems to currently be impossible with Phobos
 (unless you're *extremely* meticulous about watching what your entire
 codebase does with the data), which is a major pain when such a need
 arises.
 
 Anyone have a good workaround? For instance, maybe a function that'll take
 in a byte array and convert *all* invalid UTF-8 sequences to a
 user-selected valid character?

Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats 
it as a string instead of an array of bytes _must_ treat it as UTF-8 since it 
has to decode to determine what the characters are. So, I don't think that 
there's really any way around that. A string must be valid UTF-8. But if you 
really don't care about the string's contents, then you can just cast it to an 
array of ubyte and plenty of functions will work with it - nothing terribly 
string specific of course, but I don't see how you could possibly expect to do 
much string-specific with invalid data anyway.

- Jonathan M Davis

Jun 25 2011

"Nick Sabalausky" <a a.a> writes:

"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.1214.1309008317.14074.digitalmars-d-learn puremagic.com...
 On 2011-06-25 02:00, Nick Sabalausky wrote:
 Sometimes I need to bring data into a string, and need to be able to 
 treat
 it as an actual "string", but don't actually care if the entire thing is
 technically valid UTF-8 or not, don't care if invalid bytes don't get
 preserved right, and can't have any utf exceptions being thrown 
 regardless
 of the input. Yea, I know that's sloppy, but sometimes that's good enough
 and proper handling may be far more trouble than what's needed. (For
 example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
 you'll come across stuff that's wrong or even has the encoding type
 improperly set. But it's usually more important for the process to 
 succeed
 than for it to be perfectly accurate.)

 Far as I can tell, this seems to currently be impossible with Phobos
 (unless you're *extremely* meticulous about watching what your entire
 codebase does with the data), which is a major pain when such a need
 arises.

 Anyone have a good workaround? For instance, maybe a function that'll 
 take
 in a byte array and convert *all* invalid UTF-8 sequences to a
 user-selected valid character?

 Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually 
 treats
 it as a string instead of an array of bytes _must_ treat it as UTF-8 since 
 it
 has to decode to determine what the characters are. So, I don't think that
 there's really any way around that. A string must be valid UTF-8. But if 
 you
 really don't care about the string's contents, then you can just cast it 
 to an
 array of ubyte and plenty of functions will work with it - nothing 
 terribly
 string specific of course, but I don't see how you could possibly expect 
 to do
 much string-specific with invalid data anyway.

Using immutable(ubyte)[] just causes an enormous amount of type-related 
problems, largely involving the need to throw around a bunch of casts 
absolutely everywhere, including every single time any of the byte arrays 
needs to come in contact with an actual string (for instance, a string 
literal, for comparing,searching or anything else). It might be the 
"correct" thing, but in many cases (anything that doesn't need to be 
perfect, or can't realistically be perfect) it's far more trouble than it's 
actually worth.

Like I said, "For instance, maybe a function that'll take in a byte array 
and convert *all* invalid UTF-8 sequences to a user-selected valid 
character?" In such a case, *there would be no invalid data* in the actual 
string.

Jun 25 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I've had a similar requirement some time ago. I've had to copy and
modify the phobos function std.utf.decode for a custom text editor
because the function throws when it finds an invalid code point. This
is way too slow for my needs. I'm actually displaying invalid code
points with special marks (just like Scintilla), so I need decoding to
work as fast as possible.

The new function simply replaces throwing exceptions with flagging a boolean.

Jun 25 2011

"Nick Sabalausky" <a a.a> writes:

"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a 
 boolean.

I think I may end up doing something like that :/

I was hoping to be able to do something vaguely sensible like this:

string newStr;
foreach(dchar dc; str)
{
    if(isValidDchar(dc))
        newStr ~= dc;
    else
        newStr ~= 'X';
}
str = newStr;

But that just blows up in my face.

Jun 25 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.

 I think I may end up doing something like that :/

 I was hoping to be able to do something vaguely sensible like this:

 string newStr;
 foreach(dchar dc; str)
 {
      if(isValidDchar(dc))
          newStr ~= dc;
      else
          newStr ~= 'X';
 }
 str = newStr;

 But that just blows up in my face.

std.encoding to the rescue?
It looks like a well established module that was forgotten for some reason.

And here I'm wondering what a function named sanitize could do :)

-- 
Dmitry Olshansky

Jun 25 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-06-25 15:17, Dmitry Olshansky wrote:
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.
 
 The new function simply replaces throwing exceptions with flagging a
 boolean.

 
 I think I may end up doing something like that :/
 
 I was hoping to be able to do something vaguely sensible like this:
 
 string newStr;
 foreach(dchar dc; str)
 {
 
      if(isValidDchar(dc))
      
          newStr ~= dc;
      
      else
      
          newStr ~= 'X';
 
 }
 str = newStr;
 
 But that just blows up in my face.

 
 std.encoding to the rescue?
 It looks like a well established module that was forgotten for some reason.

It's also likely going away. It was an experiment of sorts which Andrei 
considers a failure. We need something to replace it, but as I understand it, 
it doesn't solve all of the problems that it's supposed to, and those it does 
solve, it doesn't necessarily solve in the best way. So, an improved 
replacement is going to need to be devised, but I wouldn't expect std.encoding 
to stick around in the long run.

- Jonathan M Davis

Jun 25 2011

"Nick Sabalausky" <a a.a> writes:

"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>  wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.

 I think I may end up doing something like that :/

 I was hoping to be able to do something vaguely sensible like this:

 string newStr;
 foreach(dchar dc; str)
 {
      if(isValidDchar(dc))
          newStr ~= dc;
      else
          newStr ~= 'X';
 }
 str = newStr;

 But that just blows up in my face.

 std.encoding to the rescue?
 It looks like a well established module that was forgotten for some 
 reason.

 And here I'm wondering what a function named sanitize could do :)

Ahh, I didn't even notice that module.

Even if it's imperfect and goes away, it looks like it'll at least get the 
job done for me. And the encoding conversions should even give me an easy 
way to save at least some of the invalid chars (which wasn't really a 
requirement of mine, but it'll still be nice).

Jun 25 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.

 I think I may end up doing something like that :/

 I was hoping to be able to do something vaguely sensible like this:

 string newStr;
 foreach(dchar dc; str)
 {
       if(isValidDchar(dc))
           newStr ~= dc;
       else
           newStr ~= 'X';
 }
 str = newStr;

 But that just blows up in my face.

 std.encoding to the rescue?
 It looks like a well established module that was forgotten for some
 reason.

 And here I'm wondering what a function named sanitize could do :)

 Ahh, I didn't even notice that module.

Same here, It's just a couple of days(!) ago I somehow managed to find 
decode in the wrong place (in std.encoding  instead of std.utf). And it 
looked useful, but I never heard about it. Seriously, how many totally 
irrelevant old modules we have around here? (hint: std.gregorian!)
 Even if it's imperfect and goes away, it looks like it'll at least get the
 job done for me. And the encoding conversions should even give me an easy
 way to save at least some of the invalid chars (which wasn't really a
 requirement of mine, but it'll still be nice).

Yeah, given the amount of necessary work in the Phobos realm it could 
hang around for quite sometime ;)

-- 
Dmitry Olshansky

Jun 25 2011

"Nick Sabalausky" <a a.a> writes:

"Dmitry Olshansky" <dmitry.olsh gmail.com> wrote in message 
news:iu5tan$ets$1 digitalmars.com...
 On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.

 The new function simply replaces throwing exceptions with flagging a
 boolean.

 I think I may end up doing something like that :/

 I was hoping to be able to do something vaguely sensible like this:

 string newStr;
 foreach(dchar dc; str)
 {
       if(isValidDchar(dc))
           newStr ~= dc;
       else
           newStr ~= 'X';
 }
 str = newStr;

 But that just blows up in my face.

 std.encoding to the rescue?
 It looks like a well established module that was forgotten for some
 reason.

 And here I'm wondering what a function named sanitize could do :)

 Ahh, I didn't even notice that module.

 Same here, It's just a couple of days(!) ago I somehow managed to find 
 decode in the wrong place (in std.encoding  instead of std.utf). And it 
 looked useful, but I never heard about it. Seriously, how many totally 
 irrelevant old modules we have around here? (hint: std.gregorian!)
 Even if it's imperfect and goes away, it looks like it'll at least get 
 the
 job done for me. And the encoding conversions should even give me an easy
 way to save at least some of the invalid chars (which wasn't really a
 requirement of mine, but it'll still be nice).

 Yeah, given the amount of necessary work in the Phobos realm it could hang 
 around for quite sometime ;)

Yea, and even when it does go, I can just copy it and include it manually 
(although it'll probably need some work once typedef goes away).

This seems to get the job done well enough for me, and even manages to save 
some of the intended chars:

// With std.utf and std.encoding imported:
string src = ...;
bool valid=true;
try
    validate(src);
catch(UtfException e)
    valid=false;

if(!valid)
{
    auto tmpStr = sanitize( cast(Windows1252String) src );
    transcode(tmpStr, src);
}

Jun 25 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On 2011-06-25 17:04, Dmitry Olshansky wrote:
 On 26.06.2011 3:25, Nick Sabalausky wrote:
 "Dmitry Olshansky"<dmitry.olsh gmail.com>  wrote in message
 news:iu5n32$2vjd$1 digitalmars.com...
 
 On 26.06.2011 1:49, Nick Sabalausky wrote:
 "Andrej Mitrovic"<andrej.mitrovich gmail.com>   wrote in message
 news:mailman.1215.1309019944.14074.digitalmars-d-learn puremagic.com...
 
 I've had a similar requirement some time ago. I've had to copy and
 modify the phobos function std.utf.decode for a custom text editor
 because the function throws when it finds an invalid code point. This
 is way too slow for my needs. I'm actually displaying invalid code
 points with special marks (just like Scintilla), so I need decoding to
 work as fast as possible.
 
 The new function simply replaces throwing exceptions with flagging a
 boolean.

 
 I think I may end up doing something like that :/
 
 I was hoping to be able to do something vaguely sensible like this:
 
 string newStr;
 foreach(dchar dc; str)
 {
 
       if(isValidDchar(dc))
       
           newStr ~= dc;
       
       else
       
           newStr ~= 'X';
 
 }
 str = newStr;
 
 But that just blows up in my face.

 
 std.encoding to the rescue?
 It looks like a well established module that was forgotten for some
 reason.
 
 And here I'm wondering what a function named sanitize could do :)

 
 Ahh, I didn't even notice that module.

 
 Same here, It's just a couple of days(!) ago I somehow managed to find
 decode in the wrong place (in std.encoding  instead of std.utf). And it
 looked useful, but I never heard about it. Seriously, how many totally
 irrelevant old modules we have around here? (hint: std.gregorian!)
 
 Even if it's imperfect and goes away, it looks like it'll at least get
 the job done for me. And the encoding conversions should even give me an
 easy way to save at least some of the invalid chars (which wasn't really
 a requirement of mine, but it'll still be nice).

 
 Yeah, given the amount of necessary work in the Phobos realm it could
 hang around for quite sometime ;)

Oh, it'll probably be around for a while. It'll take time before a replacement 
is devised. After, std.stream is still around, isn't it? And there's actually 
supposedly a plan regarding its replacement's implementation. There's no such 
thing with regards to std.encoding. I just thought that I should point out 
that it's likely to be replaced at some point (hopefully with something much 
better).

- Jonathan M Davis

Jun 25 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Need to do some "dirty" UTF-8 handling