www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Going on std.regex & std.uni bug-fixing hunt

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
It's been tough time on D front for me, going down from about ~1 
week of activity during July to ~2-3 days in August.

One thing I realised is that doing a new GC is going to be a long 
battle.
Before I'm too deep down this rabbit hole I decided to first 
address the long-standing backlog of issues of std.regex and 
std.uni.

My burndown list for std.regex:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr

Should get to about 5-6 enhacements.

Same for std.uni:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216641&query_format=advanced&resolution=---&short_desc=std.uni&short_desc_type=allwordssubstr

This should be actually 0.


Anything not showing up on these lists will not get fixed anytime 
soon.

The starting point for std.regex:
https://github.com/dlang/phobos/pull/5722
Sep 05 2017
next sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
wrote:
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.
Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps. (CJK characters typically have a width of two.) Character width is part of the unicode standard. Issue: https://issues.dlang.org/show_bug.cgi?id=17810 --Jon
Sep 05 2017
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Wednesday, 6 September 2017 at 05:23:51 UTC, Jon Degenhardt 
wrote:
 On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
 wrote:
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.
Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps.
Yeah, there is quite a few missing bits from std.uni. I didn't expect this one though. Hopefully we can add them on pay as go principle, so if you don't use it you don't pay for it.
 --Jon
Sep 05 2017
prev sibling parent reply Chad Joan <chadjoan gmail.com> writes:
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
wrote:
 It's been tough time on D front for me, going down from about 
 ~1 week of activity during July to ~2-3 days in August.

 One thing I realised is that doing a new GC is going to be a 
 long battle.
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.

 My burndown list for std.regex:

 https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
 ...
I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress: https://issues.dlang.org/show_bug.cgi?id=17820 It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things. ... There are other regex improvements I might be interested in, but I'm not sure I have time to make bug reports for them right now. I might be convinced to fast track them if someone wants to make legitimate effort towards fixing them, otherwise I'll eventually get around to writing the reports and/or making PRs someday. Examples: -- Calls to malloc in the CTFE path cause some regexes to fail at compile time. I suspect this happens due to the Captures (n > smallString) condition when the number of possible captures is greater than 3, but I haven't tested it (time consuming...). -- I remember being unable to iterate over named captures. But I'm not confident that I'm remembering this correctly, and I'm not sure if it's still true. -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times. -- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts. I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;) ... Dmitry, I appreciate your working towards making the regex module easier to work on. Thanks. ... I'm curious what you're thinking about when you mention something ambitious like writing a new GC :) (like this https://imgur.com/cWa4evD) I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector! But I doubt I'll ever have the opportunity to work on such things. I hope your GC attempt works out!
Sep 09 2017
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
 On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
 wrote:
 My burndown list for std.regex:

 https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
 ...
I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress: https://issues.dlang.org/show_bug.cgi?id=17820 It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.
Yeah, well known problem. Solution is to keep a bit of memory cached eg in TLS variable.
 ...

 There are other regex improvements I might be interested in, 
 but I'm not sure I have time to make bug reports for them right 
 now.  I might be convinced to fast track them if someone wants 
 to make legitimate effort towards fixing them, otherwise I'll 
 eventually get around to writing the reports and/or making PRs 
 someday.

 Examples:

 -- Calls to malloc in the CTFE path cause some regexes to fail 
 at compile time.  I suspect this happens due to the Captures (n
 smallString) condition when the number of possible captures
is greater than 3, but I haven't tested it (time consuming...).
Sholudn't be a problem, but please report an example.
 -- I remember being unable to iterate over named captures.  But 
 I'm not confident that I'm remembering this correctly, and I'm 
 not sure if it's still true.
Would be nice and simple enhancement.
 -- The Captures struct does not specify what value is returned 
 for submatches that were in the branch of an alternation that 
 wasn't taken or in a repetition that matched 0 or more than 1 
 times.
As every engine out there the value is "", empty string.
 -- The Captures struct does not seem to have a way to access 
 all of the strings matched by a submatch in repetition context, 
 not to mention nested repetition contexts.
Just like any other regex library.
 I'm not sure how much those mentions help without proper bug 
 reports, but at least I got it off my chest (for the time 
 being) without having to spend my whole weekend writing bug 
 reports ;)
Well they are warmly welcome shouldypu get to it.
 ...

 Dmitry, I appreciate your working towards making the regex 
 module easier to work on.  Thanks.

 ...

 I'm curious what you're thinking about when you mention 
 something ambitious like writing a new GC :)
 (like this https://imgur.com/cWa4evD)

 I can't help but fantasize about cheap ways to get GC 
 allocations to parallelize well and not end up writing an 
 entire generational collector!
ThreadCache can go a long way to help that.
 But I doubt I'll ever have the opportunity to work on such 
 things.  I hope your GC attempt works out!
Me too. It's won't be trivial effort though.
Sep 10 2017
parent reply Chad Joan <chadjoan gmail.com> writes:
On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky 
wrote:
 On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
 I was working on std.regex a bit myself, so I created this bug 
 report to capture some of the findings/progress:
 https://issues.dlang.org/show_bug.cgi?id=17820

 It seems like something you might be interested in, or might 
 even have a small chance of fixing in the course of other 
 things.
Yeah, well known problem. Solution is to keep a bit of memory cached eg in TLS variable.
Indeed. Is there another issue I can mark it as a duplicate of?
 [...]
 -- The Captures struct does not specify what value is returned 
 for submatches that were in the branch of an alternation that 
 wasn't taken or in a repetition that matched 0 or more than 1 
 times.
As every engine out there the value is "", empty string.
I usually don't refer to other libraries while using a library. If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon. This one seems pretty easy to fix though. I will probably make a documentation PR at some point.
 -- The Captures struct does not seem to have a way to access 
 all of the strings matched by a submatch in repetition 
 context, not to mention nested repetition contexts.
Just like any other regex library.
Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3 I actually have a strong interest in this. And not because I need to write regular expressions that extract lists of patterns all the time (well, it might've happened). More importantly: this would make it easier to integrate Phobos' regex engine into a parser generator framework. Current plans involve regular expression + parsing expression grammars. I'm pretty sure it is possible to mechanically convert a subset of PEGs into Regexes and gain some useful optimizations, but this requires granular control over regular expression captures to be able to extract the text matched by the original PEG symbols.
 I'm not sure how much those mentions help without proper bug 
 reports, but at least I got it off my chest (for the time 
 being) without having to spend my whole weekend writing bug 
 reports ;)
Well they are warmly welcome shouldypu get to it.
Thanks!
 ...

 Dmitry, I appreciate your working towards making the regex 
 module easier to work on.  Thanks.

 ...

 I'm curious what you're thinking about when you mention 
 something ambitious like writing a new GC :)
 (like this https://imgur.com/cWa4evD)

 I can't help but fantasize about cheap ways to get GC 
 allocations to parallelize well and not end up writing an 
 entire generational collector!
ThreadCache can go a long way to help that.
Google didn't help me with this one. Any chance I could get a link?
 But I doubt I'll ever have the opportunity to work on such 
 things.  I hope your GC attempt works out!
Me too. It's won't be trivial effort though.
Good luck!
Sep 10 2017
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:
 On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky 
 wrote:
 Yeah, well known problem. Solution is to keep a bit of memory 
 cached eg  in TLS variable.
Indeed. Is there another issue I can mark it as a duplicate of?
No it's just somthing that showed up in a number of benchmarks people posted, never got around to it. So just file it so I'll have an issue to tick once I fix this.
 [...]
 -- The Captures struct does not specify what value is 
 returned for submatches that were in the branch of an 
 alternation that wasn't taken or in a repetition that matched 
 0 or more than 1 times.
As every engine out there the value is "", empty string.
I usually don't refer to other libraries while using a library. If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon.
In many way we just copy ECMAScript regex semantics, with some extensions from Python and Perl.
 This one seems pretty easy to fix though.  I will probably make 
 a documentation PR at some point.
Please do.
 -- The Captures struct does not seem to have a way to access 
 all of the strings matched by a submatch in repetition 
 context, not to mention nested repetition contexts.
Just like any other regex library.
Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3
Horrible, horrible idea. Performance-wise that is.
 I actually have a strong interest in this.  And not because I 
 need to write regular expressions that extract lists of 
 patterns all the time (well, it might've happened).  More 
 importantly: this would make it easier to integrate Phobos' 
 regex engine into a parser generator framework.
No-no-no. Please don't.
 Current plans involve regular expression + parsing expression 
 grammars.  I'm pretty sure it is possible to mechanically 
 convert a subset of PEGs into Regexes and gain some useful 
 optimizations, but this requires granular control over regular 
 expression captures to be able to extract the text matched by 
 the original PEG symbols.
This is heavily misguided, but a common idea. PEGs are actually way simpler then regex. PEGs * and + have nothing to do with regex * and + qualifiers. In PEG [ab]*b will never match because [ab]* eats any sequence of a-s and b-s and never backtracks. In regex [ab]* will match b, ab, bb, .... because regex 'backtracks'. Thus one can easily see that PEGs constructs are trivial to implement, and provide quite a number of optimizations as well. I'd argue that PEG can be made to run faster in 'parse' scenario whereas nothing beats _simple_ regex in 'search' scenario.
 ThreadCache can go a long way to help that.
Google didn't help me with this one. Any chance I could get a link?
The manpage for jemalloc mentions it explicitly search for tcache. Also look through jemalloc and related papers, I think they all call it thread cache. https://linux.die.net/man/3/jemalloc In essene you keep a small cache of free memory in TLS to serve next allocations faster. Some suggest a per-cpu cache, which is bit trickier but also interestingly avoid contention.
 things.  I hope your GC attempt works out!
Me too. It's won't be trivial effort though.
Good luck!
Thanks!
Sep 10 2017