digitalmars.D - Regex and utf8
- Roman Balitskiy (3/3) Jul 20 2008 When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence"...
- Koroskin Denis (25/31) Jul 20 2008 Try removing braces, the following code sample works for me:
- Walter Bright (3/8) Jul 20 2008 The back quotes are for wysiwyg strings, and the UTF translation doesn't...
- Koroskin Denis (5/15) Jul 20 2008 =
- Walter Bright (3/20) Jul 20 2008 That's a bug with the regex engine, then. Who wants to put it in
- Roman Balitskiy (2/11) Aug 13 2008 Is there any progress towards fix of that bug?
When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I use
dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25 with
the same results.
cyrillic letter 'je'
writefln("%s[%s]%s", m.pre, m.match(0), m.post);
Jul 20 2008
On Sun, 20 Jul 2008 22:50:06 +0400, Roman Balitskiy
<realis_toleroATtoleroDOTorg_fake fake.com> wrote:
When I try to parse cyrillic text I get "Error: 4invalid UTF-8
sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed
upcomming gdc 0.25 with the same results.
is cyrillic letter 'je'
writefln("%s[%s]%s", m.pre, m.match(0), m.post);
Try removing braces, the following code sample works for me:
import std.stdio;
import std.regexp;
void main()
{
if (auto m = std.regexp.search("abÖdef", "Ö")) {
writefln("%s[%s]%s", m.pre, m.match(0), m.post);
}
}
I don't know if it's a bug or not, most probably it is.
But since Phobos console is not Unicode aware, you won't see "ab[Ö]def" as
expected but rather something like "ab[º´]def" (my output, might be
different on other locale settings).
By constrast, the Tango console I/O is more Unicode-friendly:
import tango.text.Regex;
import tango.io.Stdout;
void main()
{
foreach(m; Regex("Ö").search("abÖdef")) {
Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
}
}
Hope that helps.
Jul 20 2008
Roman Balitskiy wrote:
When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I
use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25
with the same results.
cyrillic letter 'je'
writefln("%s[%s]%s", m.pre, m.match(0), m.post);
The back quotes are for wysiwyg strings, and the UTF translation doesn't
happen. Try using "" strings instead.
Jul 20 2008
On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright = <newshound1 digitalmars.com> wrote:Roman Balitskiy wrote:When I try to parse cyrillic text I get "Error: 4invalid UTF-8 =sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have ==tryed upcomming gdc 0.25 with the same results.'t =Here is cyrillic letter 'je' writefln("%s[%s]%s", m.pre, m.match(0), m.post);The back quotes are for wysiwyg strings, and the UTF translation doesn=happen. Try using "" strings instead.Nope, it doesn't help. However, removing square brackets does.
Jul 20 2008
Koroskin Denis wrote:On Sun, 20 Jul 2008 23:45:34 +0400, Walter Bright <newshound1 digitalmars.com> wrote:That's a bug with the regex engine, then. Who wants to put it in bugzilla? <g>Roman Balitskiy wrote:Nope, it doesn't help. However, removing square brackets does.When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25 with the same results. // Here is cyrillic letter 'je' writefln("%s[%s]%s", m.pre, m.match(0), m.post);The back quotes are for wysiwyg strings, and the UTF translation doesn't happen. Try using "" strings instead.
Jul 20 2008
Walter Bright Wrote:Is there any progress towards fix of that bug?When I try to parse cyrillic text I get "Error: 4invalid UTF-8 sequence". I use dmd 1.030 on Ubuntu 8.04 with utf8 locale. I have tryed upcomming gdc 0.25 with the same results. // Here is cyrillic letter 'je' writefln("%s[%s]%s", m.pre, m.match(0), m.post);That's a bug with the regex engine, then. Who wants to put it in bugzilla? <g>
Aug 13 2008









"Koroskin Denis" <2korden gmail.com> 