digitalmars.D - std.regex performance

Jesse Phillips (23/23) Feb 08 2012 I've finely moved to the new regex for some real code. I'm seeing

David Nadlinger (4/8) Feb 08 2012 Could it be that you are rebuilding the regex engine on every iteration

Andrej Mitrovic (16/23) Feb 08 2012 Yup. This one is a steady 80 msecs (ctRegex seems to be a bit slower
bearophile (6/8) Feb 08 2012 This isn't the first time I see this problem.

Timon Gehr (2/10) Feb 08 2012 The compiler should do loop invariant code motion for pure functions.

bearophile (8/9) Feb 08 2012 Right, that too. But in DMD 2.058alpha this doesn't compile, regex() isn...

Jesse Phillips (4/12) Feb 08 2012 That is the case. The older regex apparently cached the last

Dmitry Olshansky (5/18) Feb 09 2012 I suggest to file this as an enhancement request, as new std.regex

Jesse Phillips (11/13) Feb 09 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471

Andrej Mitrovic (17/17) Feb 08 2012 Here's some runs on win32, -release -O -inline (I just generated
Martin Nowak (30/53) Feb 08 2012 There are some more performance issues.

"Jesse Phillips" <jessekphillips+D gmail.com> writes:

I've finely moved to the new regex for some real code. I'm seeing 
a major change in performance when checking if a large number of 
words contain a digit.

The english.dic file contains 134,950 entries

With
2.056: 0.22sec
2.058: 7.65sec

I don't expect a correction for this would make it in 2.058 as it 
is likely an issue in 2.057.

--------
import std.file;
import std.string;
import std.datetime;
import std.regex;

private int[string] model;

void main() {
   auto name = "english.dic";
   foreach(w; std.file.readText(name).toLower.splitLines)
      model[w] += 1;

   foreach(w; std.string.split(readText(name)))
      if(!match(w, regex(r"\d")).empty)
      {}
}

Feb 08 2012

David Nadlinger <see klickverbot.at> writes:

On 2/8/12 10:44 PM, Jesse Phillips wrote:
 foreach(w; std.string.split(readText(name)))
 if(!match(w, regex(r"\d")).empty)
 {}
 }

Could it be that you are rebuilding the regex engine on every iteration 
here?

David

Feb 08 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 2/8/12, David Nadlinger <see klickverbot.at> wrote:
 On 2/8/12 10:44 PM, Jesse Phillips wrote:
 foreach(w; std.string.split(readText(name)))
 if(!match(w, regex(r"\d")).empty)
 {}
 }

 Could it be that you are rebuilding the regex engine on every iteration
 here?

Yup. This one is a steady 80 msecs (ctRegex seems to be a bit slower
in this case for some reason):

import std.file;
import std.datetime;
import std.regex;
import std.stdio;

private int[string] model;

void main()
{
    auto sw = StopWatch(AutoStart.yes);
    auto rg = regex(r"\d");
    foreach (w; std.string.split(readText("uk.dic")))
        if (!match(w, rg).empty) { }
    writeln(sw.peek.msecs);
}

Feb 08 2012

bearophile <bearophileHUGS lycos.com> writes:

David Nadlinger:

 Could it be that you are rebuilding the regex engine on every iteration 
 here?

This isn't the first time I see this problem.
CPython caches the RE engines to avoid this problem, so after the first time
you use it, it doesn't create the same engine again and again.

Another solution is to change the API, and use a long and ugly function name
like buildREengine() that makes it clear to the user that it's better to pull
it out of loops :-)

Bye,
bearophile

Feb 08 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/08/2012 11:46 PM, bearophile wrote:
 David Nadlinger:

 Could it be that you are rebuilding the regex engine on every iteration
 here?

 This isn't the first time I see this problem.
 CPython caches the RE engines to avoid this problem, so after the first time
you use it, it doesn't create the same engine again and again.

 Another solution is to change the API, and use a long and ugly function name
like buildREengine() that makes it clear to the user that it's better to pull
it out of loops :-)

 Bye,
 bearophile

The compiler should do loop invariant code motion for pure functions.

Feb 08 2012

bearophile <bearophileHUGS lycos.com> writes:

Timon Gehr:

 The compiler should do loop invariant code motion for pure functions.

Right, that too. But in DMD 2.058alpha this doesn't compile, regex() isn't pure:


import std.regex;
void main() pure {
    auto re = regex(r"\d");
}

Bye,
bearophile

Feb 08 2012

"Jesse Phillips" <jessekphillips+D gmail.com> writes:

On Wednesday, 8 February 2012 at 22:21:35 UTC, David Nadlinger 
wrote:
 On 2/8/12 10:44 PM, Jesse Phillips wrote:
 foreach(w; std.string.split(readText(name)))
 if(!match(w, regex(r"\d")).empty)
 {}
 }

 Could it be that you are rebuilding the regex engine on every 
 iteration here?

 David

That is the case. The older regex apparently cached the last 
regex. will be more careful in the feature.

Feb 08 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 09.02.2012 3:35, Jesse Phillips wrote:
 On Wednesday, 8 February 2012 at 22:21:35 UTC, David Nadlinger wrote:
 On 2/8/12 10:44 PM, Jesse Phillips wrote:
 foreach(w; std.string.split(readText(name)))
 if(!match(w, regex(r"\d")).empty)
 {}
 }

 Could it be that you are rebuilding the regex engine on every
 iteration here?

 David

 That is the case. The older regex apparently cached the last regex. will
 be more careful in the feature.

I suggest to file this as an enhancement request, as new std.regex 
should have been backwards compatible.

-- 
Dmitry Olshansky

Feb 09 2012

"Jesse Phillips" <jessekphillips+D gmail.com> writes:

 I suggest to file this as an enhancement request, as new 
 std.regex should have been backwards compatible.

http://d.puremagic.com/issues/show_bug.cgi?id=7471

I redid the timings with mingw using time, and I find this strange

$ time ./test2.058.exe

real    0m55.500s
user    0m0.031s
sys     0m0.000s

If I know my time output, doesn't that mean the computer is 
spending 1 minute not running my program, maybe doing IO?

And 2.056 is similar, and actually takes longer in user time.

real    0m0.860s
user    0m0.047s

Feb 09 2012

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Here's some runs on win32, -release -O -inline (I just generated
134,950 duplicate words) :

2.054: 82 msecs (467KB exe)
2.055: 77 msecs (505KB exe)
2.056: 84 msecs (1095KB exe)
2.057: 3380 msecs (1179KB exe)
2.058: 3373 msecs (630KB exe)

Compile times are different too:
2.054: timeit dmd -release -O -inline test.d
Elapsed Time:     0:00:01.281

2.055: timeit dmd -release -O -inline test.d
Elapsed Time:     0:00:01.500
2.056: same

2.057: timeit dmd -release -O -inline test.d
Elapsed Time:     0:00:06.296

2.057: timeit dmd -release -O -inline test.d
Elapsed Time:     0:00:08.093

Feb 08 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Wed, 08 Feb 2012 22:44:25 +0100, Jesse Phillips  
<jessekphillips+D gmail.com> wrote:

 I've finely moved to the new regex for some real code. I'm seeing a  
 major change in performance when checking if a large number of words  
 contain a digit.

 The english.dic file contains 134,950 entries

 With
 2.056: 0.22sec
 2.058: 7.65sec

 I don't expect a correction for this would make it in 2.058 as it is  
 likely an issue in 2.057.

 --------
 import std.file;
 import std.string;
 import std.datetime;
 import std.regex;

 private int[string] model;

 void main() {
    auto name = "english.dic";
    foreach(w; std.file.readText(name).toLower.splitLines)
       model[w] += 1;

    foreach(w; std.string.split(readText(name)))
       if(!match(w, regex(r"\d")).empty)
       {}
 }

There are some more performance issues.
D has a nice built-in profiler to find such issues.

----------
import std.algorithm, std.stdio, std.string, std.path, std.regex;

private int[string] model;

int main(string[] args)
{
     if (args.length != 2)
     {
         std.stdio.stderr.writefln("usage: %s <file>",  
std.path.baseName(args[0]));
         return 1;
     }

     auto re = std.regex.regex(r"\d");
     foreach(line; std.stdio.File(args[1], "r").byLine())
     {
         // Bug 6791: splitter is UTF-8 unsafe
         foreach(w; std.algorithm.splitter(line))
         {
             if(!std.regex.match(w, re).empty)
             {
             }
         }

         std.string.toLowerInPlace(line);
         model[line.idup] += 1;
     }

     return 0;
}

Feb 08 2012

D Programming

C/C++ Programming

Other

digitalmars.D - std.regex performance