digitalmars.D - Regex and utf8

Roman Balitskiy (3/3) Jul 20 2008 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence"...

Koroskin Denis (25/31) Jul 20 2008 Try removing braces, the following code sample works for me:
Walter Bright (3/8) Jul 20 2008 The back quotes are for wysiwyg strings, and the UTF translation doesn't...

Koroskin Denis (5/15) Jul 20 2008 =

Walter Bright (3/20) Jul 20 2008 That's a bug with the regex engine, then. Who wants to put it in

Roman Balitskiy (2/11) Aug 13 2008 Is there any progress towards fix of that bug?

Roman Balitskiy <realis_toleroATtoleroDOTorg_fake fake.com> writes:

When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I use
dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25 with
the same results.


cyrillic letter 'je'
		writefln("%s[%s]%s", m.pre, m.match(0), m.post);

Jul 20 2008

"Koroskin Denis" <2korden gmail.com> writes:

On Sun, 20 Jul 2008 22:50:06 +0400, Roman Balitskiy  
<realis_toleroATtoleroDOTorg_fake fake.com> wrote:

 When I try to parse cyrillic text I get "Error: 4invalid UTF-8  
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed  
 upcomming gdc 0.25 with the same results.


 is cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);

Try removing braces, the following code sample works for me:

import std.stdio;
import std.regexp;

void main()
{
     if (auto m = std.regexp.search("ab�def", "�")) {
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);
     }
}

I don't know if it's a bug or not, most probably it is.

But since Phobos console is not Unicode aware, you won't see "ab[�]def" as  
expected but rather something like "ab[��]def" (my output, might be  
different on other locale settings).

By constrast, the Tango console I/O is more Unicode-friendly:

import tango.text.Regex;
import tango.io.Stdout;

void main()
{
     foreach(m; Regex("�").search("ab�def")) {
         Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
     }
}

Hope that helps.

Jul 20 2008

Walter Bright <newshound1 digitalmars.com> writes:

Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I
use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25
with the same results.
 

cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);
 


The back quotes are for wysiwyg strings, and the UTF translation doesn't 
happen. Try using "" strings instead.

Jul 20 2008

"Koroskin Denis" <2korden gmail.com> writes:

On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright  =

<newshound1 digitalmars.com> wrote:

 Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8  =


 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have  =


 tryed upcomming gdc 0.25 with the same results.



  =

 Here is cyrillic letter 'je'
 		writefln("%s[%s]%s", m.pre, m.match(0), m.post);


 The back quotes are for wysiwyg strings, and the UTF translation doesn=

't  =

 happen. Try using "" strings instead.

Nope, it doesn't help. However, removing square brackets does.

Jul 20 2008

Walter Bright <newshound1 digitalmars.com> writes:

Koroskin Denis wrote:
 On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright 
 <newshound1 digitalmars.com> wrote:
 
 Roman Balitskiy wrote:
 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have 
 tryed upcomming gdc 0.25 with the same results.

 // Here is cyrillic letter 'je'
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);


 The back quotes are for wysiwyg strings, and the UTF translation 
 doesn't happen. Try using "" strings instead.

 
 Nope, it doesn't help. However, removing square brackets does.

That's a bug with the regex engine, then. Who wants to put it in 
bugzilla? <g>

Jul 20 2008

Roman Balitskiy <realis_toleroATtoleroDOTorg_fake fake.com> writes:

Walter Bright Wrote:

 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 
 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have 
 tryed upcomming gdc 0.25 with the same results.

 // Here is cyrillic letter 'je'
         writefln("%s[%s]%s", m.pre, m.match(0), m.post);

 That's a bug with the regex engine, then. Who wants to put it in 
 bugzilla? <g>

Is there any progress towards fix of that bug?

Aug 13 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Regex and utf8