www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Regex and UTF-8

reply Andrea Fontana <advmail katamail.com> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I build a data access layer in c++. This layer works with mongo db where
string are always encoded using UTF-8. I've ported this layer in D using
swig. String is written correctly in console but when i use std.regex
sometimes it gives an exception:

core.exception.UnicodeException src/rt/util/utf.d(290): invalid UTF-8
sequence

Byte sequence (for better undestanding) is:
[83, 195, 179, 32]

And the string was "S=C3=B2 " (with accented o and a space)

I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
utf.d?=20
Nov 18 2011
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 18.11.2011 17:58, Andrea Fontana wrote:
 I build a data access layer in c++. This layer works with mongo db where
 string are always encoded using UTF-8. I've ported this layer in D using
 swig. String is written correctly in console but when i use std.regex
 sometimes it gives an exception:

 core.exception.UnicodeException src
 <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid
 UTF-8 sequence

 Byte sequence (for better undestanding) is:
 [83, 195, 179, 32]

 And the string was "Sò " (with accented o and a space)

 I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
 utf.d?

Which version of std.regex are you using - the one from git master or the one in the latest release? If it's the former then I'm willing to look into this thing on weekend, if you can get a hold of a pair: string + pattern that fails like this. -- Dmitry Olshansky
Nov 18 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 18.11.2011 21:07, Andrea Fontana wrote:
 It seems related to toLower too...

 Here the line with exception:

 s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower();

 Where s is a string with that sequence...

 Using dmd 2.056

You mean one of prepackaged zips|debs|etc. from the website? It uses the old regex, which, I have to admit, is not that good with unicode. Then ... well you are somewhat out of luck untill next release. That's where brand new regex engine is coming, provided I figure out mysterious FreeBSD|OSX issue (sigh). Unfortunately, I was very busy recently, though maybe this weekend I'll finally work something out. I just tested it with my version on win32 ... well it hits one of asserts (it should have been exception, ouch!), but the fix was easy. It's all about . that works as simple '.' char in [], it's just wrong to escape it inside character class (some engines do allow this, though it's confusing like hell). After that it outputs stuff like this: std.regex.RegexException std\regex.d(1939): invalid escape sequence Pattern with error: `[^"a-zA-Z0-9àòèéìù\.` <--HERE-- `]` After changing \. --> . It does work for me with s = "Sò ", no exceptions. Bottom line: Thanks, as I uncovered a serious issue i.e. misjudged assert on wrong escapes in character classes. Second if you are on win32/linux you might want to try fresh github version. And stay tuned for the next release that should fix most of regex issues once and for all.
 Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:
 On 18.11.2011 17:58, Andrea Fontana wrote:
  I build a data access layer in c++. This layer works with mongo db where
  string are always encoded using UTF-8. I've ported this layer in D using
  swig. String is written correctly in console but when i use std.regex
  sometimes it gives an exception:

  core.exception.UnicodeException src
  <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invalid
  UTF-8 sequence

  Byte sequence (for better undestanding) is:
  [83, 195, 179, 32]

  And the string was"Sò  "  (with accented o and a space)

  I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
  utf.d?

Which version of std.regex are you using - the one from git master or the one in the latest release? If it's the former then I'm willing to look into this thing on weekend, if you can get a hold of a pair: string + pattern that fails like this.


-- Dmitry Olshansky
Nov 18 2011
prev sibling parent Andrea Fontana <advmail katamail.com> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

It seems related to toLower too...

Here the line with exception:

s =3D replace(s, regex(`[^"a-zA-Z0-9=C3=A0=C3=B2=C3=A8=C3=A9=C3=AC=C3=B9\.]=
`, "g"), " ").toLower();

Where s is a string with that sequence...

Using dmd 2.056

Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:

 On 18.11.2011 17:58, Andrea Fontana wrote:
 I build a data access layer in c++. This layer works with mongo db wher=


 string are always encoded using UTF-8. I've ported this layer in D usin=


 swig. String is written correctly in console but when i use std.regex
 sometimes it gives an exception:

 core.exception.UnicodeException src
 <mailto:core.exception.UnicodeException src>/rt/util/utf.d(290): invali=


 UTF-8 sequence

 Byte sequence (for better undestanding) is:
 [83, 195, 179, 32]

 And the string was "S=C3=B2 " (with accented o and a space)

 I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
 utf.d?

Which version of std.regex are you using - the one from git master or=20 the one in the latest release? If it's the former then I'm willing to look into this thing on weekend,=

 if you can get a hold of a pair: string + pattern that fails like this.
=20
=20

Nov 18 2011