digitalmars.D.bugs - [Issue 7471] New: Improve performance of std.regex
- d-bugmail puremagic.com (48/48) Feb 09 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (10/10) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (11/11) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (14/14) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (14/14) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (13/13) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (9/9) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
http://d.puremagic.com/issues/show_bug.cgi?id=7471
Summary: Improve performance of std.regex
Product: D
Version: D2
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Phobos
AssignedTo: nobody puremagic.com
ReportedBy: Jesse.K.Phillips+D gmail.com
09:27:58 PST ---
The previous implementation is said to do some caching of the last used engine.
english.dic is 134,950 entries for these timings.
Test code
----------
import std.file;
import std.string;
import std.datetime;
import std.regex;
private int[string] model;
void main() {
auto name = "english.dic";
foreach(w; std.file.readText(name).toLower.splitLines)
model[w] += 1;
foreach(w; std.string.split(readText(name)))
if(!match(w, regex(r"\d")).empty)
{}
else if(!match(w, regex(r"\W")).empty)
{}
}
-------
I'm trying to avoid the caching here, but still see better performance in
2.056. Actually I find these timings are with mingw on Windows. I find it odd
that user time is actually fast, but real time is the slow piece, does mingw
have access to the proper information?
$ time ./test2.056.exe
real 0m0.860s
user 0m0.047s
sys 0m0.000s
$ time ./test2.058.exe
real 0m55.500s
user 0m0.031s
sys 0m0.000s
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 09 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471
Dmitry Olshansky <dmitry.olsh gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |dmitry.olsh gmail.com
11:14:52 PST ---
I'm willing to investigate the issue. Can you attach english.dic file?
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471
dawg dawgfoto.de changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |dawg dawgfoto.de
You are compiling two different regexes. So a single entry cache will only
solve part of your problem.
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471
Jesse Phillips <Jesse.K.Phillips+D gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |Jesse.K.Phillips+D gmail.co
| |m
18:02:05 PST ---
The exact file isn't important, can't get it now. But you could grab similar
from http://www.winedt.org/Dict/
I realize that the example given is avoiding the benefit of single caching, but
it does perform better and probably should be worked towards.
--
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 02:22:02 PST --- Profiling shows that about 99% of time is spent in GC, ouch. What's at work here is that new regex engine is more costly to create and allocates a bunch of structures on heap. The biggest ones of them are cached like e.g. Tries but others are not. I think I'll spend some time on introducing more caching and probably seek out some GC unfriendly stuff in parser. Still I should point out is that \d and \W in new engine are unicode aware and correspond to MUCH broader character clasess then previos engine does. (that belongs in ddocs somewhere) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 06:32:30 PST --- Anyway how compares of 2.056-2.058 when you don't create regex objects inside tight loop? It is a strange thing to do at any circumstances, even N-slot caching you pay some extra on each iteration to lookup and copy out the compiled regex needed. I'm dreaming that probably one day the compiler can just see it's a loop invariant and move it out for you. Hm.. could happen sometime soon if 'regex' is pure and then it's result is immutable, the compiler would have it's guarantees to go ahead and optimize. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 18:03:06 PST --- After moving the regex to outside the loop and I think some other changes it helped immensely. Declaring them as module variables didn't seem to gain any more. I didn't have much time to play with it much more, it was exceptionable, though I hope to do more with regex and just need to watch out for tight loops. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012









d-bugmail puremagic.com 