digitalmars.D - Regex and UTF-8

Andrea Fontana (11/11) Nov 18 2011 I build a data access layer in c++. This layer works with mongo db where

Dmitry Olshansky (7/19) Nov 18 2011 Which version of std.regex are you using - the one from git master or

Andrea Fontana (11/36) Nov 18 2011 It seems related to toLower too...

Dmitry Olshansky (24/55) Nov 18 2011 You mean one of prepackaged zips|debs|etc. from the website? It uses the...

Andrea Fontana <advmail katamail.com> writes:

I build a data access layer in c++. This layer works with mongo db where
string are always encoded using UTF-8. I've ported this layer in D using
swig. String is written correctly in console but when i use std.regex
sometimes it gives an exception:

core.exception.UnicodeException src/rt/util/utf.d(290): invalid UTF-8
sequence

Byte sequence (for better undestanding) is:
[83, 195, 179, 32]

And the string was "S=C3=B2 " (with accented o and a space)

I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
utf.d?=20

Nov 18 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 18.11.2011 17:58, Andrea Fontana wrote:
 I build a data access layer in c++. This layer works with mongo db where
 string are always encoded using UTF-8. I've ported this layer in D using
 swig. String is written correctly in console but when i use std.regex
 sometimes it gives an exception:

 core.exception.UnicodeException src
 <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid
 UTF-8 sequence

 Byte sequence (for better undestanding) is:
 [83, 195, 179, 32]

 And the string was "Sò " (with accented o and a space)

 I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
 utf.d?

Which version of std.regex are you using - the one from git master or 
the one in the latest release?
If it's the former then I'm willing to look into this thing on weekend, 
if you can get a hold of a pair: string + pattern that fails like this.


-- 
Dmitry Olshansky

Nov 18 2011

Andrea Fontana <advmail katamail.com> writes:

It seems related to toLower too...

Here the line with exception:

s =3D replace(s, regex(`[^"a-zA-Z0-9=C3=A0=C3=B2=C3=A8=C3=A9=C3=AC=C3=B9\.]=
`, "g"), " ").toLower();

Where s is a string with that sequence...

Using dmd 2.056

Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:

 On 18.11.2011 17:58, Andrea Fontana wrote:
 I build a data access layer in c++. This layer works with mongo db wher=


e
 string are always encoded using UTF-8. I've ported this layer in D usin=


g
 swig. String is written correctly in console but when i use std.regex
 sometimes it gives an exception:

 core.exception.UnicodeException src
 <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invali=


d
 UTF-8 sequence

 Byte sequence (for better undestanding) is:
 [83, 195, 179, 32]

 And the string was "S=C3=B2 " (with accented o and a space)

 I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
 utf.d?

=20
 Which version of std.regex are you using - the one from git master or=20
 the one in the latest release?
 If it's the former then I'm willing to look into this thing on weekend,=

=20
 if you can get a hold of a pair: string + pattern that fails like this.
=20
=20

Nov 18 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 18.11.2011 21:07, Andrea Fontana wrote:
 It seems related to toLower too...

 Here the line with exception:

 s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower();

 Where s is a string with that sequence...

 Using dmd 2.056

You mean one of prepackaged zips|debs|etc. from the website? It uses the 
old regex, which, I have to admit, is not that good with unicode. Then 
... well you are somewhat out of luck untill next release.

That's where brand new regex engine is coming, provided I figure out 
mysterious FreeBSD|OSX issue (sigh). Unfortunately, I was very busy 
recently, though maybe this weekend I'll finally work something out.

I just tested it with my version on win32 ... well it hits one of 
asserts (it should have been exception, ouch!), but the fix was easy. 
It's all about . that works as simple '.' char in [], it's just wrong to 
escape it inside character class (some engines do allow this, though 
it's confusing like hell).
After that it outputs stuff like this:
std.regex.RegexException std\regex.d(1939): invalid escape sequence
Pattern with error: `[^"a-zA-Z0-9àòèéìù\.` <--HERE-- `]`

After changing \. --> . It does work for me with s = "Sò  ", no exceptions.

Bottom line:
Thanks, as I uncovered a serious issue i.e. misjudged assert on wrong 
escapes in character classes.
Second if you are on win32/linux you might want to try fresh github version.
And stay tuned for the next release that should fix most of regex issues 
once and for all.

 Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:
 On 18.11.2011 17:58, Andrea Fontana wrote:
  I build a data access layer in c++. This layer works with mongo db where
  string are always encoded using UTF-8. I've ported this layer in D using
  swig. String is written correctly in console but when i use std.regex
  sometimes it gives an exception:

  core.exception.UnicodeException src
  <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid
  UTF-8 sequence

  Byte sequence (for better undestanding) is:
  [83, 195, 179, 32]

  And the string was"Sò  "  (with accented o and a space)

  I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
  utf.d?

 Which version of std.regex are you using - the one from git master or
 the one in the latest release?
 If it's the former then I'm willing to look into this thing on weekend,
 if you can get a hold of a pair: string + pattern that fails like this.



-- 
Dmitry Olshansky

Nov 18 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Regex and UTF-8