www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Regex and utf8

reply Roman Balitskiy <realis_toleroATtoleroDOTorg_fake fake.com> writes:
When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I use
dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25 with
the same results.

	if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   // Here is
cyrillic letter 'je'
		writefln("%s[%s]%s", m.pre, m.match(0), m.post);
Jul 20 2008
next sibling parent "Koroskin Denis" <2korden gmail.com> writes:
On Sun, 20 Jul 2008 22:50:06 +0400, Roman Balitskiy  
<realis_toleroATtoleroDOTorg_fake fake.com> wrote:

 When I try to parse cyrillic text I get "Error: 4invalid UTF-8  
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed  
 upcomming gdc 0.25 with the same results.

 	if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   // Here  
 is cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);

Try removing braces, the following code sample works for me: import std.stdio; import std.regexp; void main() { if (auto m = std.regexp.search("abÖdef", "Ö")) { writefln("%s[%s]%s", m.pre, m.match(0), m.post); } } I don't know if it's a bug or not, most probably it is. But since Phobos console is not Unicode aware, you won't see "ab[Ö]def" as expected but rather something like "ab[º´]def" (my output, might be different on other locale settings). By constrast, the Tango console I/O is more Unicode-friendly: import tango.text.Regex; import tango.io.Stdout; void main() { foreach(m; Regex("Ö").search("abÖdef")) { Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post); } } Hope that helps.
Jul 20 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I
use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25
with the same results.
 
 	if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   // Here is
cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);
 

The back quotes are for wysiwyg strings, and the UTF translation doesn't happen. Try using "" strings instead.
Jul 20 2008
parent reply "Koroskin Denis" <2korden gmail.com> writes:
On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright  =

<newshound1 digitalmars.com> wrote:

 Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8  =


 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have  =


 tryed upcomming gdc 0.25 with the same results.
  	if (auto m =3D std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   //=


 Here is cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);

The back quotes are for wysiwyg strings, and the UTF translation doesn=

 happen. Try using "" strings instead.

Nope, it doesn't help. However, removing square brackets does.
Jul 20 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Koroskin Denis wrote:
 On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright 
 <newshound1 digitalmars.com> wrote:
 
 Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have 
 tryed upcomming gdc 0.25 with the same results.
      if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   
 // Here is cyrillic letter 'je'
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);

The back quotes are for wysiwyg strings, and the UTF translation doesn't happen. Try using "" strings instead.

Nope, it doesn't help. However, removing square brackets does.

That's a bug with the regex engine, then. Who wants to put it in bugzilla? <g>
Jul 20 2008
parent Roman Balitskiy <realis_toleroATtoleroDOTorg_fake fake.com> writes:
Walter Bright Wrote:

 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have 
 tryed upcomming gdc 0.25 with the same results.
      if (auto m = std.regexp.search(`ab&#1078;def`, `[&#1078;]`))   
 // Here is cyrillic letter 'je'
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);

bugzilla? <g>

Is there any progress towards fix of that bug?
Aug 13 2008