digitalmars.D - Regex performance

Andrei Alexandrescu (3/3) Mar 24 2012 This might be worth looking into. Dmitry?

Kapps (6/9) Mar 24 2012 A difference of that amount is likely expecting something like

Nick Sabalausky (41/52) Mar 24 2012 Yea, I agree that's what it sounds like. I tried to post a response, but...

Nick Sabalausky (21/34) Mar 24 2012 Erm, the formatting got fucked up:

Dmitry Olshansky (6/23) Mar 24 2012 No luck here as well.

James Blewitt (15/15) Mar 25 2012 Hello everyone,

Jay Norwood (38/45) Mar 26 2012 That was the same type of thing I was seeing with very simple

Dmitry Olshansky (11/54) Mar 26 2012 This is a sad fact of life, the general tool can't beat highly

James Blewitt (12/12) Mar 26 2012 Hello everybody,

Dmitry Olshansky (8/20) Mar 26 2012 No need to apologize, but you are using 2.054, which is unfashionable :)...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (8/31) Mar 26 2012 My unofficial results comparing 2.056 to 2.058 on 64 bits:

James Miller (8/14) Mar 26 2012 Dmitry did impressive work over those few version of Phobos/DMD. The
Jesse Phillips (10/17) Mar 26 2012 Unofficial 2.056/2.058/Ruby 1.9.3 Windows 32bit data.csv:

James Blewitt (4/4) Mar 27 2012 Great!

Andrei Alexandrescu (4/8) Mar 27 2012 Hi James -- you may want to link this discussion from your blog.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

This might be worth looking into. Dmitry?

http://jblewitt.com/blog/?p=462


Andrei

Mar 24 2012

"Kapps" <opantm2+spam gmail.com> writes:

On Saturday, 24 March 2012 at 23:06:54 UTC, Andrei Alexandrescu 
wrote:
 This might be worth looking into. Dmitry?

 http://jblewitt.com/blog/?p=462


 Andrei

A difference of that amount is likely expecting something like 
regex("Blah") to not have to create a new regex struct each time, 
something which I'm guessing Ruby does (as do other standard 
libraries like .NET).

Mar 24 2012

"Nick Sabalausky" <a a.a> writes:

"Kapps" <opantm2+spam gmail.com> wrote in message 
news:yudtvjsuhhimrhqaixos forum.dlang.org...
 On Saturday, 24 March 2012 at 23:06:54 UTC, Andrei Alexandrescu wrote:
 This might be worth looking into. Dmitry?

 http://jblewitt.com/blog/?p=462


 Andrei

 A difference of that amount is likely expecting something like 
 regex("Blah") to not have to create a new regex struct each time, 
 something which I'm guessing Ruby does (as do other standard libraries 
 like .NET).

Yea, I agree that's what it sounds like. I tried to post a response, but I'm 
just getting this result (and yes, this is with JS enabled):

--------------------------------------------
Asirra validation failed!
ticket = start ASIRRAVALIDATION ir=cd ir  data=

  start DEBUG ir=cd ir  data=exceptions.Exception: invalid ticket formatend 



  Fail
  exceptions.Exception: invalid ticket format
--------------------------------------------If it's working for anyone else, 
maybe you could post it for me?:

--------------------------------------------
A few things on the D verison:

- Make sure you're using a recent version of DMD. The regex engine was 
overhauled fairly recently (I forget exactly which version, but the latest, 
2.058 definitely has it, along with some bugfixes.)

- Make sure you're using "std.regex", not the deprecated "std.regexp".

- It sounds like this may be your main problem: Make sure you're not 
re-creating the same regex multiple times:

// Bad:
foreach(str; strings)
{
    auto result = match(str, regex("abc.*def"));
}

// Good:
auto myRegex = regex("abc.*def");
foreach(str; strings)
{
    auto result = match(str, myRegex);
}

Some regex engines cache the regex, but D's does't ATM. I think that'll 
likely get fixed though.

- Even better yet, if your regex string is a literal (or otherwise known or 
computable at compile-time) as above, use the compile-time version instead:

auto myRegex = ctRegex!"abc.*def";
// [...same 'foreach' loop as before...]
--------------------------------------------

Mar 24 2012

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:jklomm$2seb$1 digitalmars.com...
 Yea, I agree that's what it sounds like. I tried to post a response, but 
 I'm just getting this result (and yes, this is with JS enabled):

 --------------------------------------------
 Asirra validation failed!
 ticket = start ASIRRAVALIDATION ir=cd ir  data=

  start DEBUG ir=cd ir  data=exceptions.Exception: invalid ticket formatend 



  Fail
  exceptions.Exception: invalid ticket format
 --------------------------------------------If it's working for anyone 
 else, maybe you could post it for me?:

Erm, the formatting got fucked up:

--------------------------------------------
Asirra validation failed!


ticket =
start ASIRRAVALIDATION ir=
cd ir  data=

start RESULT ir=1
cd ir 1 data=Fail

cd ir 0 data=

start DEBUG ir=
cd ir  data=exceptions.Exception: invalid ticket format

cd ir 0 data=



XML:

  Fail
  exceptions.Exception: invalid ticket format
--------------------------------------------

Mar 24 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25.03.2012 4:26, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:jklomm$2seb$1 digitalmars.com...
 Yea, I agree that's what it sounds like. I tried to post a response, but
 I'm just getting this result (and yes, this is with JS enabled):

 --------------------------------------------
 Asirra validation failed!
 ticket = start ASIRRAVALIDATION ir=cd ir  data=

   start DEBUG ir=cd ir  data=exceptions.Exception: invalid ticket formatend



   Fail
   exceptions.Exception: invalid ticket format
 --------------------------------------------If it's working for anyone


No luck here as well.
I'm really curious of that guy's actual problem though.
First bet is, of course, non-cached regex or regexp.

-- 
Dmitry Olshansky

Mar 24 2012

"James Blewitt" <jim jblewitt.com> writes:

Hello everyone,

I'm the author of the blog post.

First of all, thanks so much for the interest in my problem.  I 
had no idea that the D community was so active (a fact that 
pleases me greatly).

A quick update.  I've written a small benchmark based on my real 
code and I'm now getting *significantly* better performance from 
my D code.

I'm currently trying to figure out what I'm doing differently in 
my original program.  At this point I am assuming that I have an 
error in my code which causes the D program to do much more work 
that its Ruby counterpart (although I am currently unable to find 
it).

When I know more I will let you know.

James Blewitt

Mar 25 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Sunday, 25 March 2012 at 16:31:40 UTC, James Blewitt wrote:
 I'm currently trying to figure out what I'm doing differently 
 in my original program.  At this point I am assuming that I 
 have an error in my code which causes the D program to do much 
 more work that its Ruby counterpart (although I am currently 
 unable to find it).

 When I know more I will let you know.

 James Blewitt

That was the same type of thing I was seeing with very simple 
regex expressions. The regex was on the order of 30 times slower 
than hand code for finding words in strings.  The ctRegex is on 
the order of 13x slower than hand code.  The times below are from 
parallel processing on 100MB of text files, just finding the word 
boundaries.  I uploaded that tests in 
https://github.com/jnorwood/wc_test
I believe in all these cases the files are being cached by the 
os, since I was able to see the same measurements from a ramdisk 
done with imdisk.  So in these cases the file reads are about 
30ms of the result. The rest is cpu time, finding the words.

  This is with default 7 threads

finished wcp_wcPointer! time: 98 ms
finished wcp_wcCtRegex! time: 1300 ms
finished wcp_wcRegex! time: 2946 ms
finished wcp_wcRegex2! time: 2687 ms
finished wcp_wcSlices! time: 157 ms
finished wcp_wcStdAscii! time: 225 ms


This is processing the same data with 1 thread

finished wcp_wcPointer! time: 188 ms
finished wcp_wcCtRegex! time: 2219 ms
finished wcp_wcRegex! time: 5951 ms
finished wcp_wcRegex2! time: 5502 ms
finished wcp_wcSlices! time: 318 ms
finished wcp_wcStdAscii! time: 446 ms

And this is processing the same data with 13 threads

finished wcp_wcPointer! time: 93 ms
finished wcp_wcCtRegex! time: 1110 ms
finished wcp_wcRegex! time: 2531 ms
finished wcp_wcRegex2! time: 2321 ms
finished wcp_wcSlices! time: 136 ms
finished wcp_wcStdAscii! time: 200 ms

The only change in the program that is uploaded is to add the 
suggested
defaultPoolThreads(13);
at the start of main to change the ThreadPool default thread 
count.

Mar 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26.03.2012 20:00, Jay Norwood wrote:
 On Sunday, 25 March 2012 at 16:31:40 UTC, James Blewitt wrote:
 I'm currently trying to figure out what I'm doing differently in my
 original program. At this point I am assuming that I have an error in
 my code which causes the D program to do much more work that its Ruby
 counterpart (although I am currently unable to find it).

 When I know more I will let you know.

 James Blewitt

 That was the same type of thing I was seeing with very simple regex
 expressions. The regex was on the order of 30 times slower than hand
 code for finding words in strings.

This is a sad fact of life, the general tool can't beat highly 
specialized things. Ideally it can be on par though. Even in the best 
case ctRegex has to do a lot of things a simple == '\n' doesn't do, like 
storing boundaries of match. That's something to keep in mind.

By the way, regex does fine job on (semi-)fixed strings of length >= 
3-4, often easily beating plain find/indexOf. I haven't tested 
Boyer-Moore version of find, that should be faster then regex for sure.

The ctRegex is on the order of 13x
 slower than hand code. The times below are from parallel processing on
 100MB of text files, just finding the word boundaries. I uploaded that
 tests in https://github.com/jnorwood/wc_test
 I believe in all these cases the files are being cached by the os, since
 I was able to see the same measurements from a ramdisk done with imdisk.
 So in these cases the file reads are about 30ms of the result. The rest
 is cpu time, finding the words.

 This is with default 7 threads

 finished wcp_wcPointer! time: 98 ms
 finished wcp_wcCtRegex! time: 1300 ms
 finished wcp_wcRegex! time: 2946 ms
 finished wcp_wcRegex2! time: 2687 ms
 finished wcp_wcSlices! time: 157 ms
 finished wcp_wcStdAscii! time: 225 ms


 This is processing the same data with 1 thread

 finished wcp_wcPointer! time: 188 ms
 finished wcp_wcCtRegex! time: 2219 ms
 finished wcp_wcRegex! time: 5951 ms
 finished wcp_wcRegex2! time: 5502 ms
 finished wcp_wcSlices! time: 318 ms
 finished wcp_wcStdAscii! time: 446 ms

 And this is processing the same data with 13 threads

 finished wcp_wcPointer! time: 93 ms
 finished wcp_wcCtRegex! time: 1110 ms
 finished wcp_wcRegex! time: 2531 ms
 finished wcp_wcRegex2! time: 2321 ms
 finished wcp_wcSlices! time: 136 ms
 finished wcp_wcStdAscii! time: 200 ms

 The only change in the program that is uploaded is to add the suggested
 defaultPoolThreads(13);
 at the start of main to change the ThreadPool default thread count.


-- 
Dmitry Olshansky

Mar 26 2012

"James Blewitt" <jim jblewitt.com> writes:

Hello everybody,

Thanks once again for the interest in my problem.  I have posted 
the details and source code that recreates (at least for me) the 
poor performance.
I didn't know how to post the code to the forum, so I posted it 
to my blog instead (see post update):

http://jblewitt.com/blog/?p=462

Again, if I'm doing something stupid in my code (which is 
possible) then I apologise in advance.

I'll take a look at the ctRegex as soon as I can.

Regards,
James

Mar 26 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 27.03.2012 0:27, James Blewitt wrote:
 Hello everybody,

 Thanks once again for the interest in my problem. I have posted the
 details and source code that recreates (at least for me) the poor
 performance.
 I didn't know how to post the code to the forum, so I posted it to my
 blog instead (see post update):

 http://jblewitt.com/blog/?p=462

 Again, if I'm doing something stupid in my code (which is possible) then
 I apologise in advance.

No need to apologize, but you are using 2.054, which is unfashionable :) 
More importantly 2.054 contains old and rusty version of std.regex, the 
new version was included in 2.057+.
BTW The current release is 2.058.

 I'll take a look at the ctRegex as soon as I can.

Yup, just update compiler+phobos.

 Regards,
 James


-- 
Dmitry Olshansky

Mar 26 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/26/2012 02:41 PM, Dmitry Olshansky wrote:
 On 27.03.2012 0:27, James Blewitt wrote:
 Hello everybody,

 Thanks once again for the interest in my problem. I have posted the
 details and source code that recreates (at least for me) the poor
 performance.
 I didn't know how to post the code to the forum, so I posted it to my
 blog instead (see post update):

 http://jblewitt.com/blog/?p=462

 Again, if I'm doing something stupid in my code (which is possible) then
 I apologise in advance.

 No need to apologize, but you are using 2.054, which is unfashionable :)
 More importantly 2.054 contains old and rusty version of std.regex, the
 new version was included in 2.057+.
 BTW The current release is 2.058.

 I'll take a look at the ctRegex as soon as I can.

 Yup, just update compiler+phobos.

 Regards,
 James


My unofficial results comparing 2.056 to 2.058 on 64 bits:

shakespeare.txt, 2.056 -> 1868 msecs
shakespeare.txt, 2.058 ->  632 msecs

data.csv, 2.056 -> 51953 msecs
data.csv, 2.058 ->  1329 msecs

That last line is pretty impressive. :)

Ali

Mar 26 2012

James Miller <james aatch.net> writes:

On 27 March 2012 11:05, Ali =C3=87ehreli <acehreli yahoo.com> wrote:
 My unofficial results comparing 2.056 to 2.058 on 64 bits:

 shakespeare.txt, 2.056 -> 1868 msecs
 shakespeare.txt, 2.058 -> =C2=A0632 msecs

 data.csv, 2.056 -> 51953 msecs
 data.csv, 2.058 -> =C2=A01329 msecs

 That last line is pretty impressive. :)

Dmitry did impressive work over those few version of Phobos/DMD. The
performance is even more impressive when you consider that std.regex
supports things like named matching and lookbehind that often slow
down a regex (also kinda removes the "regular" from the name regular
expression, technically)

--
James Miller

Mar 26 2012

"Jesse Phillips" <Jessekphillips+D gmail.com> writes:

On Monday, 26 March 2012 at 22:05:34 UTC, Ali Çehreli wrote:
 My unofficial results comparing 2.056 to 2.058 on 64 bits:

 shakespeare.txt, 2.056 -> 1868 msecs
 shakespeare.txt, 2.058 ->  632 msecs

 data.csv, 2.056 -> 51953 msecs
 data.csv, 2.058 ->  1329 msecs

 That last line is pretty impressive. :)

 Ali

Unofficial 2.056/2.058/Ruby 1.9.3 Windows 32bit data.csv:



data.csv, 2.056 -> 76351 msecs
data.csv, 2.058 ->  2573 msecs
data.csv, 1.9.3 ->  9170 msecs

Also I had to modify line 48 of the ruby file not knowing what 
I'm doing:



Couldn't build it with ctRegex (Some Error, then ran out of 
memory).

Mar 26 2012

"James Blewitt" <jim jblewitt.com> writes:

Great!

Thanks for the support everyone.  What a performance jump between 
v2.054 and v2.058!

James

Mar 27 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/27/12 1:57 AM, James Blewitt wrote:
 Great!

 Thanks for the support everyone. What a performance jump between v2.054
 and v2.058!

 James

Hi James -- you may want to link this discussion from your blog.

Cheers,

Andrei

Mar 27 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Regex performance