www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Do we need faster regex?

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
So I’ve been working on rewind-regex trying to correct all of the 
decisions in the original engine that slowed it down, dropping 
some features that I knew I cannot implement efficiently 
(backreferences have to go).

So while I’m obsessed with simplicity and speed I thought I’d ask 
people if it was an issue and what they really want from gen2 
regex library.

—
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me
Dec 17 2023
next sibling parent reply Witold Baryluk <witold.baryluk gmail.com> writes:
On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).

 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.

 —
 Dmitry Olshansky
 CEO   Glowlabs
 https://olshansky.me
I never needed to use backreferences in last 25 years. Something that has similar performance and scaling like RE2, Plan 9 grep, or original awk (modulo unicode), would be the best. I somebody needs backreferences, they can use PCRE, or some 3rd party libraries.
Dec 17 2023
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Sunday, 17 December 2023 at 17:40:21 UTC, Witold Baryluk wrote:
 On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
 wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).

 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.

 —
 Dmitry Olshansky
 CEO   Glowlabs
 https://olshansky.me
I never needed to use backreferences in last 25 years. Something that has similar performance and scaling like RE2, Plan 9 grep, or original awk (modulo unicode), would be the best.
Yes, RE2 is the main contender, rewind-regex aims at basically the same feature set and better/same performance.
 I somebody needs backreferences, they can use PCRE, or some 3rd 
 party libraries.
Same thoughts here. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 17 2023
prev sibling next sibling parent reply Anonymouse <zorael gmail.com> writes:
On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
wrote:
 [...]
I really like regex as a thing, but I have had to drop it because of the increased compilation memory and time requirements that use of `std.regex` incurred. Maybe that can't be avoided, I don't know. I have not had that much use for backreferences, so to me the important part is that it doesn't require me to have 300+ Mb more RAM and an extra second or two on the wall clock to compile.
Dec 18 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 18/12/2023 10:20 PM, Anonymouse wrote:
 I really like regex as a thing, but I have had to drop it because of the 
 increased compilation memory and time requirements that use of 
 |std.regex| incurred. Maybe that can't be avoided, I don't know.
 
 I have not had that much use for backreferences, so to me the important 
 part is that it doesn't require me to have 300+ Mb more RAM and an extra 
 second or two on the wall clock to compile.
As long as you use std.regex at runtime this is no longer the case. https://github.com/dlang/phobos/pull/8699 https://github.com/dlang/phobos/pull/8698 Looks like I messed up and not got it into the changelog. Anyway 2.103 should be the release that have them in it.
Dec 18 2023
prev sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via Digitalmars-d
wrote:
 So I’ve been working on rewind-regex trying to correct all of the
 decisions in the original engine that slowed it down, dropping some
 features that I knew I cannot implement efficiently (backreferences
 have to go).
 
 So while I’m obsessed with simplicity and speed I thought I’d ask
 people if it was an issue and what they really want from gen2 regex
 library.
[...] What I really want: - Reduce compile-time cost of `import std.regex;` to zero, or at least close enough it's no longer noticeable. - Automatic caching of fixed-string regexes, i.e., the equivalent of: struct Re(string ctKnownRe) { Regex!char re; shared static this() { re = regex(ctKnownRe); } Regex!char Re() { return re; } } void main() { string s; if (s.matchFirst(Re!`some\+pattern`)) { ... } // This should reuse the Regex instance from before: if (s.matchFirst(Re!`some\+pattern`)) { ... } } - Reasonably fast runtime performance. I don't really care if it's the top-of-the-line superfast regex matcher, even though that would be really nice. The primary pain points are the cost of import, and the need to manually write code for automatic caching of fixed runtime regexen. - Get rid of ctRegex -- it adds a huge compile-time cost with questionable runtime benefit. Unless there's a way to do this at compile-time that *doesn't* add like 5 seconds per regex to compile times. T -- That's not a bug; that's a feature!
Dec 18 2023
next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
If you import std.regex and do nothing else:

sema1 61ms
sema2 101ms

std.internal.unicode_tables:
sema1 4ms
sema2 100ms

https://github.com/dlang/phobos/blob/master/std/internal/unicode_tables.d



Now for:

```d
auto ctr = ctRegex!(`^.*/([^/]+)/?$`);

// It works just like a normal regex:
auto c2 = matchFirst("foo/bar", ctr);   // First match found here, if any
assert(!c2.empty);   // Be sure to check if there is a match before 
examining contents!
assert(c2[1] == "bar");   // Captures is a range of submatches: 0 = full 
match.
```

toImpl:
sema3 125ms

parseSet:
Sema3 111ms

findAny (child of parseRegex and parseSet):

CTFE 51ms





While there are improvements to be had, I have already done all the big 
ones when I shaved off ~600ms earlier this year.

Gotta love time trace!
Dec 18 2023
parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Tue, Dec 19, 2023 at 06:34:04AM +1300, Richard (Rikki) Andrew Cattermole via
Digitalmars-d wrote:
 If you import std.regex and do nothing else:
 
 sema1 61ms
 sema2 101ms
 
 std.internal.unicode_tables:
 sema1 4ms
 sema2 100ms
[...]
 While there are improvements to be had, I have already done all the
 big ones when I shaved off ~600ms earlier this year.
 
 Gotta love time trace!
Cool, awesome stuff! T -- If blunt statements had a point, they wouldn't be blunt...
Dec 18 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
Yeah basically std.regex is no longer the cause for importing std.regex 
slowdown.

Its stuff like std.conv and std.uni.
Dec 18 2023
parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) Andrew Cattermole via
Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.
I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time. There comes a point where repeatedly doing something at every compile just isn't worth it when the desired output could be autogenerated beforehand and saved as a straight .d file with hard-coded values. T -- Heads I win, tails you lose.
Dec 18 2023
next sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
The tables are stored compressed.

We do that because otherwise the binary size of Phobos would grow by 
multiple megabytes (think 10, not 1).

This is why when you trigger usage of regex at CT it slows down a whole 
lot. One of my optimizations was to prevent this unless it was needed.

However there are a lot of templates in use in each of the tables, and 
these could all be swapped out for something else and I think that might 
shave off something close to 100ms.

The tradeoffs for the tables are good ones, but it does bite us a bit here.
Dec 18 2023
prev sibling next sibling parent Hipreme <msnmancini hotmail.com> writes:
On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:
 On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) 
 Andrew Cattermole via Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing 
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.
I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time. There comes a point where repeatedly doing something at every compile just isn't worth it when the desired output could be autogenerated beforehand and saved as a straight .d file with hard-coded values. T
std.conv slowdown comes from `std.conv.to!float` specifically. The other ones looks fine I guess. about rikki's statement, it can help, but whenever you import the values, it will still take a lot of time in compilation time, you can look at `core.sys.windows.uuid` for reference, most of the compilation time spent on any windows module is this one taking a lot, and when you look at it, it is only a lot of definitions. So, yes, I think it could be a lot better. One example I've done was to separate the complete generation in Metal and never import the file which actually does all the CTFE and mixin's. Think of this file like only important to the linker and not needed to be used. A .di file with only definitions could help even more, and let other thing implement it so only the symbols would be imported.
Dec 18 2023
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:
 On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) 
 Andrew Cattermole via Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing 
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.
I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time.
Far as I know there is not CTFE involved with the tables, but they are kind of huge. Perhaps doing .di files and enforce separate compilation may work here.
 There comes a point where repeatedly doing something at every 
 compile just isn't worth it when the desired output could be 
 autogenerated beforehand and saved as a straight .d file with 
 hard-coded values.
Which they are. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 20 2023
parent reply BoQsc <vaidas.boqsc gmail.com> writes:
The focus should always be on slower but more maintainable and 
simple implementation.

The efficiency and speed should be left for specific cases where 
it is needed and, again with as much clarity as possible.

Support and Features over efficiency and speed.

The question of "is anyone even using that?" is a wrong question.
If it's a common behaviour among implementations, it should exist.

If behaviour makes sense, it should exist and not avoided to be 
implemented just to gain some fraction of speed that can be 
gained in other ways or again, using more specific implementation 
for the job.

If you think that your implementation can be grown into 
supporting all crucial features, then it should be a good draft 
for other people to explore and complete implementation.

Else it should be a specific case implementation that could be 
selected if there is a need for speed but with the sacrifice of 
features.
Dec 26 2023
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Tuesday, 26 December 2023 at 15:58:03 UTC, BoQsc wrote:
 The focus should always be on slower but more maintainable and 
 simple implementation.
Agreed on simple, though humbly disagree on speed.
 The efficiency and speed should be left for specific cases 
 where it is needed and, again with as much clarity as possible.

 Support and Features over efficiency and speed.
Well we already have std.regex which is quite fast and feature rich, sadly simplicity is not an option if we are to support full ECMAScript regex language and basic level 1 unicode regex.
 The question of "is anyone even using that?" is a wrong 
 question.
 If it's a common behaviour among implementations, it should 
 exist.
RE2 specifically avoids complicated features that block design of a fast engine.
 If behaviour makes sense, it should exist and not avoided to be 
 implemented just to gain some fraction of speed that can be 
 gained in other ways or again, using more specific 
 implementation for the job.

 If you think that your implementation can be grown into 
 supporting all crucial features, then it should be a good draft 
 for other people to explore and complete implementation.
There are very few folks who want to develop regex engine I think. Encouraging collaboration is an interesting angle I did not account for.
 Else it should be a specific case implementation that could be 
 selected if there is a need for speed but with the sacrifice of 
 features.
Okay, I understand your points. For the most part I could summarize my point of view as follows. Building regex engine without regard for speed is not challenging for me, nor do I think that a simple slow engine could be gradually improved into simple fast engine. Speed is something you have to think of laying the first brick of whatever you are building, iff speed is desired. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 30 2023
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Monday, 18 December 2023 at 17:16:40 UTC, H. S. Teoh wrote:
 On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via 
 Digitalmars-d wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).
 
 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.
[...] What I really want: - Reduce compile-time cost of `import std.regex;` to zero, or at least close enough it's no longer noticeable. - Automatic caching of fixed-string regexes, i.e., the equivalent of: struct Re(string ctKnownRe) { Regex!char re; shared static this() { re = regex(ctKnownRe); } Regex!char Re() { return re; } }
A runtime cache should work, btw std.regex caches regexes (at least those passed as strings to match* family of functions).
 	void main() {
 		string s;
 		if (s.matchFirst(Re!`some\+pattern`)) {
 			...
 		}

 		// This should reuse the Regex instance from before:
 		if (s.matchFirst(Re!`some\+pattern`)) {
 			...
 		}
 	}
I'm thinking if it's worth it to intern patterns like that.
 - Reasonably fast runtime performance. I don't really care if 
 it's the
   top-of-the-line superfast regex matcher, even though that 
 would be
   really nice.  The primary pain points are the cost of import, 
 and the
   need to manually write code for automatic caching of fixed 
 runtime
   regexen.
 - Get rid of ctRegex -- it adds a huge compile-time cost with
   questionable runtime benefit. Unless there's a way to do this 
 at
   compile-time that *doesn't* add like 5 seconds per regex to 
 compile
   times.
Yup it's dropped, to be eventually replaced by JIT which is both better at compile-time and much more flexible at run-time. --- Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 18 2023
parent "H. S. Teoh" <hsteoh qfbox.info> writes:
On Mon, Dec 18, 2023 at 06:34:51PM +0000, Dmitry Olshansky via Digitalmars-d
wrote:
[...]
 A runtime cache should work, btw std.regex caches regexes (at least
 those passed as strings to match* family of functions).
Cool, didn't know that. :-) [...]
 - Get rid of ctRegex -- it adds a huge compile-time cost with
   questionable runtime benefit. Unless there's a way to do this at
   compile-time that *doesn't* add like 5 seconds per regex to
   compile times.
Yup it's dropped, to be eventually replaced by JIT which is both better at compile-time and much more flexible at run-time.
[...] Awesome stuff! T -- A mathematician is a device for turning coffee into theorems. -- P. Erdos
Dec 18 2023