www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - FReD is ready for review

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
FReD ( Fast Regular expressions for D) is a GSOC project proposed as 
source level compatible replacement for std.regex. Needless to say it is 
also superior in many ways ;)

Source + docs + goodies packaged here:
https://github.com/downloads/blackwhale/FReD/FReD.zip

Goodies include popular regex-dna benchmark, by my measures we should be 
at least #3 among single threaded solutions. Though I used 64bit VM 
linux box, on win32 sadly it runs out of memory on the same machine 
(false pointers?).

Speaking of source there are few artifacts that should end up in std.uni 
and not in std.regex:
CodepointSet and CodepointTrie, and unicode property tables.
I'd rather have them all accessible to user, but their interface is 
admittedly clunky, I'm open to ideas on better API.

Notable caveats:
	- backreferences allowed only to locally unique groups, e.g. 
\b(\w+)\b\1 is allowed, same w/o any of \b - not.
	- replace with delegeate now takes Captures!R where R is e.g. string. 
This is due to having few types of engine and RegexMatch!X respectively. 
Previous signature allowed a certain measure of abuse e.g. you could 
have done a couple of popFronts on it matcher(!) in this delegate. 
However in most cases the code inside can be left as is.

Otherwise it should work with existing code after replacing imports.

P.S. I'm taking a seat somewhere in review queue, and busying myself 
with newspaper :)

-- 
Dmitry Olshansky
Sep 13 2011
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
Hi, thanks for your hard work!

On Wed, 14 Sep 2011 00:04:08 +0300, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 https://github.com/downloads/blackwhale/FReD/FReD.zip

Doesn't seem to compile on the latest DMD, due to switch case fallthrough.
 Goodies include popular regex-dna benchmark, by my measures we should be  
 at least #3 among single threaded solutions.

This is awesome, but if it's not a secret, who are the first two, and how far behind is FReD? Does this regard regular or compile-time regex?
 Though I used 64bit VM linux box, on win32 sadly it runs out of memory  
 on the same machine (false pointers?).

If I'll find time to port my memory debugger to D2 before someone else figures it out, I'll have a look at this. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.09.2011 1:18, Vladimir Panteleev wrote:
 Hi, thanks for your hard work!

 On Wed, 14 Sep 2011 00:04:08 +0300, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 https://github.com/downloads/blackwhale/FReD/FReD.zip

Doesn't seem to compile on the latest DMD, due to switch case fallthrough.

Arrrrgh! On serious note it is fixed in some recent pull request, but that was hopelessly late for getting in this release.
 Goodies include popular regex-dna benchmark, by my measures we should
 be at least #3 among single threaded solutions.

This is awesome, but if it's not a secret, who are the first two, and how far behind is FReD? Does this regard regular or compile-time regex?

Namely RE2 and irregex. On my machine RE2 is ~5 sec while FReD does the job in ~7sec. As for irregex, ... uhm-oh I haven't actually tested it (failed to just compile) but it beats RE2 in this test by some ~30%. I'm judging by this page, RE2 I mentioned is "C++ GNU g++ #2" but with OpenMP declarations stripped: http://shootout.alioth.debian.org/u32/performance.php?test=regexdna Damn looks like we could be #4... these things change too fast :( Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.
 Though I used 64bit VM linux box, on win32 sadly it runs out of memory
 on the same machine (false pointers?).

If I'll find time to port my memory debugger to D2 before someone else figures it out, I'll have a look at this.

allocating large buffers in replace loop. Probably a better solution to this is alternative replace function that doesn't allocate on each call but uses a OutputRange. -- Dmitry Olshansky
Sep 13 2011
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.09.2011 1:42, Vladimir Panteleev wrote:
 On Wed, 14 Sep 2011 00:37:28 +0300, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 Anyway, the rest of competition is not even close, and I still haven't
 spent all of my tricks.

That's really cool... I noticed that d_dna.d doesn't use regexCt, would that make any difference or am I overestimating the advantage of compile-time codegen?

In this particular case it may very well much faster actually. The thing about this benchmark is that it exposes all the nasty constant overhead there is in the engine. On the other hand C-T regex doesn't have most of "what do we do next" overhead so it should be faster. However when it comes to some serious business like matching charsets and doing ambiguous control flow it starts losing to more superior design of the default engine. -- Dmitry Olshansky
Sep 13 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.09.2011 1:55, Vladimir Panteleev wrote:
 On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 However when it comes to some serious business like matching charsets
 and doing ambiguous control flow it starts losing to more superior
 design of the default engine.

Ah, okay. I thought compile-time regex was the same as the default engine, but without the runtime parsing and bytecode interpretation(?).

Let me explain a bit. The default as chosen is fairly advanced NFA engine which features: - never backtracks and consequently decodes each UTF8/16/32 character only once; - as side effect doesn't support fully combinatorial backreferences (why would someone use 'em anyway); - guaranteed O(N*M) time on input with M being "size" of pattern; This also opens up matching directly on buffered streams... but back to the point. There is also a slower traditional backtracking engine, it still faster then many other engines out there but it ebbs out on some patterns like .*.*.*Z in text with no 'Z', precisely because of combinatorial nature of backtracking. Then C-T is built on top of backtracking, just because it was much easier this way :) It's indeed replacing bytecode interpretations with hardcoded code which is exceptionally fast, but the prime time consumer in real patterns is usually this memoize/return. C-T regex is still a bit experimental though and going to be improved over time. Last but not least there is also completely dumb (as in dead simple and fast) kickstart engine that does all of heavy lifting iff pattern starts with something more or less constant (e.g. it does 50% of job in regex-dna). And you can just parse at compile time and use default run-time engine for matching: enum r = regex("whatever"); -- Dmitry Olshansky
Sep 13 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/13/11 4:37 PM, Dmitry Olshansky wrote:
 OK in this particular benchmark, we are slighlty behind Google inc. :)
 Namely RE2 and irregex.
 On my machine RE2 is ~5 sec while FReD does the job in ~7sec.
 As for irregex, ... uhm-oh I haven't actually tested it (failed to just
 compile) but it beats RE2 in this test by some ~30%.

 I'm judging by this page, RE2 I mentioned is "C++ GNU g++ #2" but with
 OpenMP declarations stripped:
 http://shootout.alioth.debian.org/u32/performance.php?test=regexdna

 Damn looks like we could be #4... these things change too fast :(
 Anyway, the rest of competition is not even close, and I still haven't
 spent all of my tricks.

Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%. As regexen have a fairly standardized interface, speed is the most impactful differentiatior people look at. Andrei
Sep 13 2011
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.09.2011 2:44, Andrei Alexandrescu wrote:
 On 9/13/11 4:37 PM, Dmitry Olshansky wrote:
 OK in this particular benchmark, we are slighlty behind Google inc. :)
 Namely RE2 and irregex.
 On my machine RE2 is ~5 sec while FReD does the job in ~7sec.
 As for irregex, ... uhm-oh I haven't actually tested it (failed to just
 compile) but it beats RE2 in this test by some ~30%.

 I'm judging by this page, RE2 I mentioned is "C++ GNU g++ #2" but with
 OpenMP declarations stripped:
 http://shootout.alioth.debian.org/u32/performance.php?test=regexdna

 Damn looks like we could be #4... these things change too fast :(
 Anyway, the rest of competition is not even close, and I still haven't
 spent all of my tricks.

Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%.

Yes, sir! :) As regexen have a fairly standardized interface, speed is the
 most impactful differentiatior people look at.

Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory. --- Dmitry Olshansky
Sep 14 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/14/11 7:51 AM, Dmitry Olshansky wrote:
 On 14.09.2011 2:44, Andrei Alexandrescu wrote:
 On 9/13/11 4:37 PM, Dmitry Olshansky wrote:
 OK in this particular benchmark, we are slighlty behind Google inc. :)
 Namely RE2 and irregex.
 On my machine RE2 is ~5 sec while FReD does the job in ~7sec.
 As for irregex, ... uhm-oh I haven't actually tested it (failed to just
 compile) but it beats RE2 in this test by some ~30%.

 I'm judging by this page, RE2 I mentioned is "C++ GNU g++ #2" but with
 OpenMP declarations stripped:
 http://shootout.alioth.debian.org/u32/performance.php?test=regexdna

 Damn looks like we could be #4... these things change too fast :(
 Anyway, the rest of competition is not even close, and I still haven't
 spent all of my tricks.

Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%.

Yes, sir! :) As regexen have a fairly standardized interface, speed is the
 most impactful differentiatior people look at.

Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory. --- Dmitry Olshansky

Great, thanks. Could you please email me privately some more information about the benchmark and your test code? I'm considering using this in my upcoming talk at the Strange Loop conference. Thanks again, Andrei Andrei
Sep 14 2011
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Dmitry Olshansky:

 Goodies include popular regex-dna benchmark, by my measures we should be 
 at least #3 among single threaded solutions.

This is rather good. The #1 is of JavaScript-V8 that has received a significant amount of engineering efforts.
 Though I used 64bit VM 
 linux box, on win32 sadly it runs out of memory on the same machine 
 (false pointers?).

This is not so good. Even if this problem is not solved soon, it will be good to know its real cause.
 P.S. I'm taking a seat somewhere in review queue, and busying myself 
 with newspaper :)

*Gives some fine white-golden tea* Bye, bearophile
Sep 13 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 13.09.2011, 23:22 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:

 Dmitry Olshansky:

 Goodies include popular regex-dna benchmark, by my measures we should be
 at least #3 among single threaded solutions.

This is rather good. The #1 is of JavaScript-V8 that has received a significant amount of engineering efforts.

I heard a Google engineer talk about that. In JavaScript regular expressions are often used for a wider range of operations on strings (e.g. splitting), since they were generally faster than doing the same in pure JavaScript. It is one of those few occasions where a JS developer can rely on native code in a library doing most of the work.
Sep 13 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wed, 14 Sep 2011 00:18:03 +0300, Vladimir Panteleev  
<vladimir thecybershadow.net> wrote:

 Doesn't seem to compile on the latest DMD, due to switch case  
 fallthrough.

Sorry, I had -wi enabled in my build environment. (Why would -wi cause fatal errors, though? Sounds like a bug.) -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 13, 2011 14:25 Vladimir Panteleev wrote:
 On Wed, 14 Sep 2011 00:18:03 +0300, Vladimir Panteleev
 
 <vladimir thecybershadow.net> wrote:
 Doesn't seem to compile on the latest DMD, due to switch case
 fallthrough.

Sorry, I had -wi enabled in my build environment. (Why would -wi cause fatal errors, though? Sounds like a bug.)

-wi shouldn't cause fatal errors, but anything going into Phobos needs to build with -w. So, if the proposed changes don't build with -w, that needs to be fixed. - Jonathan M Davis
Sep 13 2011
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:
 Speaking of source there are few artifacts that should end up in std.uni
 and not in std.regex:
 CodepointSet and CodepointTrie, and unicode property tables.
 I'd rather have them all accessible to user, but their interface is
 admittedly clunky, I'm open to ideas on better API.

I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex. If the unicode stuff is entirely an implementation detail of std.regex and is not in the API, then we can review the unicode stuff separately and make std.regex use the stuff in std.uni once it's in there, but if you're doing major unicode stuff, I think that that's significant enough to merit its own review. - Jonathan M Davis
Sep 13 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 14.09.2011 1:36, Jonathan M Davis wrote:
 On Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:
 Speaking of source there are few artifacts that should end up in std.uni
 and not in std.regex:
 CodepointSet and CodepointTrie, and unicode property tables.
 I'd rather have them all accessible to user, but their interface is
 admittedly clunky, I'm open to ideas on better API.

I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex.

Particularly one can't get reasonable performance w/o unicode incorporated as an implementation detail yet I feel there should be a more general interface to it. Otherwise everyone willing to tackle unicode would have to, for instance, duplicate all of unicode property tables, and that's some 100Kb we are looking at.
 If the unicode stuff is entirely an implementation detail of std.regex and is
 not in the API, then we can review the unicode stuff separately and make
 std.regex use the stuff in std.uni once it's in there, but if you're doing
 major unicode stuff, I think that that's significant enough to merit its own
 review.

It could be done in two steps, if I was to choose which way to go I'd first folded in std.regex then moved all its hidden unicode stuff to std.uni. It's mostly about character classification and about doing it fast. -- Dmitry Olshansky
Sep 13 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wed, 14 Sep 2011 00:37:28 +0300, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 Anyway, the rest of competition is not even close, and I still haven't  
 spent all of my tricks.

That's really cool... I noticed that d_dna.d doesn't use regexCt, would that make any difference or am I overestimating the advantage of compile-time codegen? -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 However when it comes to some serious business like matching charsets  
 and doing ambiguous control flow it starts losing to more superior  
 design of the default engine.

Ah, okay. I thought compile-time regex was the same as the default engine, but without the runtime parsing and bytecode interpretation(?). -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 13, 2011 14:52 Dmitry Olshansky wrote:
 On 14.09.2011 1:36, Jonathan M Davis wrote:
 On Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:
 Speaking of source there are few artifacts that should end up in std.uni
 and not in std.regex:
 CodepointSet and CodepointTrie, and unicode property tables.
 I'd rather have them all accessible to user, but their interface is
 admittedly clunky, I'm open to ideas on better API.

I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex.

It doesn't affect API at all, but an important implementation detail. Particularly one can't get reasonable performance w/o unicode incorporated as an implementation detail yet I feel there should be a more general interface to it. Otherwise everyone willing to tackle unicode would have to, for instance, duplicate all of unicode property tables, and that's some 100Kb we are looking at.
 If the unicode stuff is entirely an implementation detail of std.regex
 and is not in the API, then we can review the unicode stuff separately
 and make std.regex use the stuff in std.uni once it's in there, but if
 you're doing major unicode stuff, I think that that's significant enough
 to merit its own review.

It could be done in two steps, if I was to choose which way to go I'd first folded in std.regex then moved all its hidden unicode stuff to std.uni. It's mostly about character classification and about doing it fast.

As long as it doesn't affect the API, then we can review std.regex first, but we do definitely want to get any major unicode improvements into std.uni. And that will probably merit its own review. - Jonathan M Davis
Sep 13 2011
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 Well in this particular benchmark there is little I can do, some small  
 scale optimizations and such, I expect no more then 10%.
 The bigger picture is that our best bet here is static regex, the  
 competitors use some aggressive JIT so it's a fair play. I actually  
 received pull request this morning do just that  (thx, Martin) though  
 dmd uses so much memory it's not for mortals yet:

 I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9
 + 2 Gb of swap - 6 out of 9
 Come on, it's a sport!
 + 4 Gb of swap and after few restless minutes I got it working

 Fresh timings:
 FReD C-T  3.4 sec
 FReD R-T  6.6 sec
 RE2       4.8 sec
 V8 JS     3.7 sec

 Look ma, we are on top! :)

 So, by the end of day, D does kick ass, even though it's a bit of  
 Pyrrhic victory.

Even so, that is awesome. Another point to D for its metaprogramming. -- Simen
Sep 14 2011
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 14.09.2011, 17:10 Uhr, schrieb Simen Kjaeraas <simen.kjaras gmail.com>:

 On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky  
 <dmitry.olsh gmail.com> wrote:

 Well in this particular benchmark there is little I can do, some small  
 scale optimizations and such, I expect no more then 10%.
 The bigger picture is that our best bet here is static regex, the  
 competitors use some aggressive JIT so it's a fair play. I actually  
 received pull request this morning do just that  (thx, Martin) though  
 dmd uses so much memory it's not for mortals yet:

 I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9
 + 2 Gb of swap - 6 out of 9
 Come on, it's a sport!
 + 4 Gb of swap and after few restless minutes I got it working

 Fresh timings:
 FReD C-T  3.4 sec
 FReD R-T  6.6 sec
 RE2       4.8 sec
 V8 JS     3.7 sec

 Look ma, we are on top! :)

 So, by the end of day, D does kick ass, even though it's a bit of  
 Pyrrhic victory.

Even so, that is awesome. Another point to D for its metaprogramming.

Wait with the champaign until this doesn't use 8 GB RAM during compile any more ;) Other than that, congrats! This is really impressive. I think you do some great pioneer work, that will help other who try to embed simple domain specific languages.
Sep 14 2011
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Wed, 14 Sep 2011 17:36:25 +0200, Marco Leise <Marco.Leise gmx.de> wrote:

 Am 14.09.2011, 17:10 Uhr, schrieb Simen Kjaeraas  
 <simen.kjaras gmail.com>:

 On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky  
 <dmitry.olsh gmail.com> wrote:

 Well in this particular benchmark there is little I can do, some small  
 scale optimizations and such, I expect no more then 10%.
 The bigger picture is that our best bet here is static regex, the  
 competitors use some aggressive JIT so it's a fair play. I actually  
 received pull request this morning do just that  (thx, Martin) though  
 dmd uses so much memory it's not for mortals yet:

 I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9
 + 2 Gb of swap - 6 out of 9
 Come on, it's a sport!
 + 4 Gb of swap and after few restless minutes I got it working

 Fresh timings:
 FReD C-T  3.4 sec
 FReD R-T  6.6 sec
 RE2       4.8 sec
 V8 JS     3.7 sec

 Look ma, we are on top! :)

 So, by the end of day, D does kick ass, even though it's a bit of  
 Pyrrhic victory.

Even so, that is awesome. Another point to D for its metaprogramming.

Wait with the champaign until this doesn't use 8 GB RAM during compile any more ;) Other than that, congrats! This is really impressive. I think you do some great pioneer work, that will help other who try to embed simple domain specific languages.

It peaks at 4GB for me and takes about 20s to compile.
Sep 14 2011