digitalmars.D - Going on std.regex & std.uni bug-fixing hunt

Dmitry Olshansky (17/17) Sep 05 2017 It's been tough time on D front for me, going down from about ~1

Jon Degenhardt (11/14) Sep 05 2017 Not intending to add to your workload, but I added an enhancement

Dmitry Olshansky (6/18) Sep 05 2017 Yeah, there is quite a few missing bits from std.uni. I didn't

Chad Joan (42/52) Sep 09 2017 I was working on std.regex a bit myself, so I created this bug

Dmitry Olshansky (10/60) Sep 10 2017 Yeah, well known problem. Solution is to keep a bit of memory

Chad Joan (25/73) Sep 10 2017 Indeed.

Dmitry Olshansky (28/79) Sep 10 2017 No it's just somthing that showed up in a number of benchmarks

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

It's been tough time on D front for me, going down from about ~1 
week of activity during July to ~2-3 days in August.

One thing I realised is that doing a new GC is going to be a long 
battle.
Before I'm too deep down this rabbit hole I decided to first 
address the long-standing backlog of issues of std.regex and 
std.uni.

My burndown list for std.regex:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr

Should get to about 5-6 enhacements.

Same for std.uni:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216641&query_format=advanced&resolution=---&short_desc=std.uni&short_desc_type=allwordssubstr

This should be actually 0.


Anything not showing up on these lists will not get fixed anytime 
soon.

The starting point for std.regex:
https://github.com/dlang/phobos/pull/5722

Sep 05 2017

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
wrote:
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.

Not intending to add to your workload, but I added an enhancement 
request for something I hit ran into recently: Having a version 
of wcwidth/wcswidth functions available as part of unicode 
support. These functions calculate the width of a character in 
fixed-width fonts, they are useful in command-line apps. (CJK 
characters typically have a width of two.) Character width is 
part of the unicode standard. Issue: 
https://issues.dlang.org/show_bug.cgi?id=17810

--Jon

Sep 05 2017

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Wednesday, 6 September 2017 at 05:23:51 UTC, Jon Degenhardt 
wrote:
 On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
 wrote:
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.

 Not intending to add to your workload, but I added an 
 enhancement request for something I hit ran into recently: 
 Having a version of wcwidth/wcswidth functions available as 
 part of unicode support. These functions calculate the width of 
 a character in fixed-width fonts, they are useful in 
 command-line apps.

Yeah, there is quite a few missing bits from std.uni. I didn't 
expect this one though.

Hopefully we can add them on pay as go principle, so if you don't 
use it you don't pay for it.

 --Jon

Sep 05 2017

Chad Joan <chadjoan gmail.com> writes:

On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky 
wrote:
 It's been tough time on D front for me, going down from about 
 ~1 week of activity during July to ~2-3 days in August.

 One thing I realised is that doing a new GC is going to be a 
 long battle.
 Before I'm too deep down this rabbit hole I decided to first 
 address the long-standing backlog of issues of std.regex and 
 std.uni.

 My burndown list for std.regex:

 https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
 ...

I was working on std.regex a bit myself, so I created this bug 
report to capture some of the findings/progress:
https://issues.dlang.org/show_bug.cgi?id=17820

It seems like something you might be interested in, or might even 
have a small chance of fixing in the course of other things.

...

There are other regex improvements I might be interested in, but 
I'm not sure I have time to make bug reports for them right now.  
I might be convinced to fast track them if someone wants to make 
legitimate effort towards fixing them, otherwise I'll eventually 
get around to writing the reports and/or making PRs someday.

Examples:

-- Calls to malloc in the CTFE path cause some regexes to fail at 
compile time.  I suspect this happens due to the Captures (n > 
smallString) condition when the number of possible captures is 
greater than 3, but I haven't tested it (time consuming...).

-- I remember being unable to iterate over named captures.  But 
I'm not confident that I'm remembering this correctly, and I'm 
not sure if it's still true.

-- The Captures struct does not specify what value is returned 
for submatches that were in the branch of an alternation that 
wasn't taken or in a repetition that matched 0 or more than 1 
times.

-- The Captures struct does not seem to have a way to access all 
of the strings matched by a submatch in repetition context, not 
to mention nested repetition contexts.


I'm not sure how much those mentions help without proper bug 
reports, but at least I got it off my chest (for the time being) 
without having to spend my whole weekend writing bug reports ;)

...

Dmitry, I appreciate your working towards making the regex module 
easier to work on.  Thanks.

...

I'm curious what you're thinking about when you mention something 
ambitious like writing a new GC :)
(like this https://imgur.com/cWa4evD)

I can't help but fantasize about cheap ways to get GC allocations 
to parallelize well and not end up writing an entire generational 
collector!  But I doubt I'll ever have the opportunity to work on 
such things.  I hope your GC attempt works out!

Sep 09 2017

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky
wrote:
My burndown list for std.regex:

https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr
...

I was working on std.regex a bit myself, so I created this bug
report to capture some of the findings/progress:
https://issues.dlang.org/show_bug.cgi?id=17820

It seems like something you might be interested in, or might
even have a small chance of fixing in the course of other
things.

Yeah, well known problem. Solution is to keep a bit of memory
cached eg in TLS variable.

...

There are other regex improvements I might be interested in,
but I'm not sure I have time to make bug reports for them right
now. I might be convinced to fast track them if someone wants
to make legitimate effort towards fixing them, otherwise I'll
eventually get around to writing the reports and/or making PRs
someday.

Examples:

-- Calls to malloc in the CTFE path cause some regexes to fail
at compile time. I suspect this happens due to the Captures (n
smallString) condition when the number of possible captures

is greater than 3, but I haven't tested it (time consuming...).

Sholudn't be a problem, but please report an example.

-- I remember being unable to iterate over named captures. But
I'm not confident that I'm remembering this correctly, and I'm
not sure if it's still true.

Would be nice and simple enhancement.

-- The Captures struct does not specify what value is returned
for submatches that were in the branch of an alternation that
wasn't taken or in a repetition that matched 0 or more than 1
times.

As every engine out there the value is "", empty string.

-- The Captures struct does not seem to have a way to access
all of the strings matched by a submatch in repetition context,
not to mention nested repetition contexts.

Just like any other regex library.

I'm not sure how much those mentions help without proper bug
reports, but at least I got it off my chest (for the time
being) without having to spend my whole weekend writing bug
reports ;)

Well they are warmly welcome shouldypu get to it.

...

Dmitry, I appreciate your working towards making the regex
module easier to work on. Thanks.

...

I'm curious what you're thinking about when you mention
something ambitious like writing a new GC :)
(like this https://imgur.com/cWa4evD)

I can't help but fantasize about cheap ways to get GC
allocations to parallelize well and not end up writing an
entire generational collector!

ThreadCache can go a long way to help that.

But I doubt I'll ever have the opportunity to work on such
things. I hope your GC attempt works out!

Me too. It's won't be trivial effort though.

Sep 10 2017

Chad Joan <chadjoan gmail.com> writes:

On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky
wrote:
On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:
I was working on std.regex a bit myself, so I created this bug
report to capture some of the findings/progress:
https://issues.dlang.org/show_bug.cgi?id=17820

It seems like something you might be interested in, or might
even have a small chance of fixing in the course of other
things.

Yeah, well known problem. Solution is to keep a bit of memory
cached eg in TLS variable.

Indeed.

Is there another issue I can mark it as a duplicate of?

[...]
-- The Captures struct does not specify what value is returned
for submatches that were in the branch of an alternation that
wasn't taken or in a repetition that matched 0 or more than 1
times.

As every engine out there the value is "", empty string.

I usually don't refer to other libraries while using a library.
If an API doesn't define something, then it is, by definition,
undefined behavior, and thus quite undesirable to rely upon.

This one seems pretty easy to fix though. I will probably make a
documentation PR at some point.

-- The Captures struct does not seem to have a way to access
all of the strings matched by a submatch in repetition
context, not to mention nested repetition contexts.

Just like any other regex library.

Counterexample:
https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3

I actually have a strong interest in this. And not because I
need to write regular expressions that extract lists of patterns
all the time (well, it might've happened). More importantly:
this would make it easier to integrate Phobos' regex engine into
a parser generator framework. Current plans involve regular
expression + parsing expression grammars. I'm pretty sure it is
possible to mechanically convert a subset of PEGs into Regexes
and gain some useful optimizations, but this requires granular
control over regular expression captures to be able to extract
the text matched by the original PEG symbols.

I'm not sure how much those mentions help without proper bug
reports, but at least I got it off my chest (for the time
being) without having to spend my whole weekend writing bug
reports ;)

Well they are warmly welcome shouldypu get to it.

Thanks!

...

Dmitry, I appreciate your working towards making the regex
module easier to work on. Thanks.

...

I'm curious what you're thinking about when you mention
something ambitious like writing a new GC :)
(like this https://imgur.com/cWa4evD)

I can't help but fantasize about cheap ways to get GC
allocations to parallelize well and not end up writing an
entire generational collector!

ThreadCache can go a long way to help that.

Google didn't help me with this one. Any chance I could get a
link?

But I doubt I'll ever have the opportunity to work on such
things. I hope your GC attempt works out!

Me too. It's won't be trivial effort though.

Good luck!

Sep 10 2017

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:
On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky
wrote:
Yeah, well known problem. Solution is to keep a bit of memory
cached eg in TLS variable.

Indeed.

Is there another issue I can mark it as a duplicate of?

No it's just somthing that showed up in a number of benchmarks
people posted, never got around to it. So just file it so I'll
have an issue to tick once I fix this.

[...]
-- The Captures struct does not specify what value is
returned for submatches that were in the branch of an
alternation that wasn't taken or in a repetition that matched
0 or more than 1 times.

As every engine out there the value is "", empty string.

I usually don't refer to other libraries while using a library.
If an API doesn't define something, then it is, by definition,
undefined behavior, and thus quite undesirable to rely upon.

In many way we just copy ECMAScript regex semantics, with some
extensions from Python and Perl.

This one seems pretty easy to fix though. I will probably make
a documentation PR at some point.

Please do.
-- The Captures struct does not seem to have a way to access
all of the strings matched by a submatch in repetition
context, not to mention nested repetition contexts.

Just like any other regex library.

Counterexample:
https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3

Horrible, horrible idea. Performance-wise that is.

No-no-no. Please don't.

Current plans involve regular expression + parsing expression
grammars. I'm pretty sure it is possible to mechanically
convert a subset of PEGs into Regexes and gain some useful
optimizations, but this requires granular control over regular
expression captures to be able to extract the text matched by
the original PEG symbols.

This is heavily misguided, but a common idea. PEGs are actually
way simpler then regex. PEGs * and + have nothing to do with
regex * and + qualifiers.

In PEG [ab]*b will never match because [ab]* eats any sequence of
a-s and b-s and never backtracks.

In regex [ab]* will match b, ab, bb, .... because regex
'backtracks'.

Thus one can easily see that PEGs constructs are trivial to
implement, and provide quite a number of optimizations as well.

I'd argue that PEG can be made to run faster in 'parse' scenario
whereas nothing beats _simple_ regex in 'search' scenario.

ThreadCache can go a long way to help that.

Google didn't help me with this one. Any chance I could get a
link?

The manpage for jemalloc mentions it explicitly search for
tcache. Also look through jemalloc and related papers, I think
they all call it thread cache.

https://linux.die.net/man/3/jemalloc

In essene you keep a small cache of free memory in TLS to serve
next allocations faster. Some suggest a per-cpu cache, which is
bit trickier but also interestingly avoid contention.

things. I hope your GC attempt works out!

Me too. It's won't be trivial effort though.

Good luck!

Thanks!

Sep 10 2017

D Programming

C/C++ Programming

Other

digitalmars.D - Going on std.regex & std.uni bug-fixing hunt