digitalmars.D.learn - std.regex character consumption

petevik38 yahoo.com.au (26/26) Oct 08 2010 I've been running into a few problems with regular expressions in D. One

Jonathan M Davis (8/43) Oct 08 2010 Well, without looking at the code, I can't say for certain what's going ...

petevik38 yahoo.com.au writes:

I've been running into a few problems with regular expressions in D. One
of the issues I've had recently is matching strings with non ascii
characters. As an example:

    auto re = regex( `(.*)\.txt`, "i" );
    re.printProgram();
    auto m = match( "bà.txt", re );
    writefln( "'%s'", m.captures[1] );

When I run this I get the following error:

dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116
120] around index 0
printProgram()
  0: 	REparen len=1 n=0, pc=>10
  9: 	REanystar
 10: 	REistring x4, '.txt'
 19: 	REend

While investigating the cause, I noticed that during execution of many
of the regex instructions (e.g. REanystar), the source is advanced with:

                src++;

However in other cases (REanychar), it is advanced with:

                src += std.utf.stride(input, src);

I found that by replacing the code REanystar with stride, the code
worked as expected. Although I can't claim to have a solid understanding
of the code, it seems to me that most of the cases of src++ should be
using stride instead.

Is this correct, or have I made some silly mistake and got completely
the wrong end of the stick?

Oct 08 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, October 08, 2010 14:13:36 petevik38 yahoo.com.au wrote:
 I've been running into a few problems with regular expressions in D. One
 of the issues I've had recently is matching strings with non ascii
 characters. As an example:
=20
     auto re =3D regex( `(.*)\.txt`, "i" );
     re.printProgram();
     auto m =3D match( "b=C3=A0.txt", re );
     writefln( "'%s'", m.captures[1] );
=20
 When I run this I get the following error:
=20
 dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116
 120] around index 0
 printProgram()
   0: 	REparen len=3D1 n=3D0, pc=3D>10
   9: 	REanystar
  10: 	REistring x4, '.txt'
  19: 	REend
=20
 While investigating the cause, I noticed that during execution of many
 of the regex instructions (e.g. REanystar), the source is advanced with:
=20
                 src++;
=20
 However in other cases (REanychar), it is advanced with:
=20
                 src +=3D std.utf.stride(input, src);
=20
 I found that by replacing the code REanystar with stride, the code
 worked as expected. Although I can't claim to have a solid understanding
 of the code, it seems to me that most of the cases of src++ should be
 using stride instead.
=20
 Is this correct, or have I made some silly mistake and got completely
 the wrong end of the stick?

Well, without looking at the code, I can't say for certain what's going on,=
 but=20
using ++ with chars or wchars is definitely wrong in virtually all cases.=20
stride() will actually go to the next code point, while ++ will just go to =
the=20
next code unit, which could be in the middle of a code point.

=2D Jonathan M Davis

Oct 08 2010

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.regex character consumption