www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - code review: splitIds from DConf '22 day 3: saving a sort and "getting

reply kdevel <kdevel vogtner.de> writes:
At DConf '22 day 3 Robert Schadek presented at around 07:22:00 in 
the YT video the function `splitIds`. Given an HTML page from 
bugzilla containing a list of issues `splitIds` aims at 
extracting all bug-ids referenced within a specific url context:

```
long [] splitIds (string page)
{
    enum re = ctRegex!(`"show_bug.cgi\?id=[0-9]+"`);
    auto m = page.matchAll (re);

    return m
       .filter!(it => it.length > 0)                 // what is 
this?
       .map!(it => it.front)                         // whole 
match, it[0]
       .map!(it => it.find!(isNumber))               // searches 
fist number
       .map!(it => it.until!(it => !it.isNumber ())) // last number
       .filter!(it => !it.empty)                     // again an 
empty check??? why?
       .map!(it => it.to!long ())
       .uniq                                         // .sort is 
missing. IMHO saving at the wrong things?
       .array;
}
```

`m` contains all matches. It is a "list of lists" as one would 
say in Perl. The "inner lists" contains as first element 
("`front`") the string which matches the whole pattern. So my 
first question is:

What is the purpose of the first filter call? Since the element 
of `m` is a match it cannot have a length of 0.

Then the code – and this took me some time to grasp – zooms into 
the match and tries to match the subexpression which starts with 
a number and ends with a number. But why does the author not use 
regular expression subexpression matching [1]? Is that really 
such an arcane discipline?

Next comes a filter call that seems to defend against the empty 
string. But there will be no empty string because `[0-9]+` only 
matches if there is at least one number.

Then there is something missing: Before the `uniq` call there is 
the assumption that bugzilla ordered the bugs already. What 
puzzles me is that on the one hand the author puts two find calls 
into the chain to guard against empty matches which cannot (?) 
happen but on the other hand ignores that the data may not be 
sorted. I mean the sortedness is nothing that the author of the 
function could control.

This is what my version of `splitIds` looks like:

```
long [] my_splitIds (string page)
{
    enum re = ctRegex!(`"show_bug.cgi\?id=([0-9]+)"`);
    auto m = page.matchAll (re);

    return m
       .map!(it => it[1]) // take the subexpression
       .map!(it => it.to!long ())
       .array
       .sort
       .uniq
       .array
       ;
}
```

For those who ask how one would write such a function in Perl:

```
sub splitIds ($)
{
    my $page = shift;
    my  ids = $page =~ m/"show_bug.cgi\?id=([0-9]+)"/g;


with <=>
}
```



[1] https://dlang.org/phobos/std_regex.html
Aug 04 2022
parent user1234 <user1234 12.de> writes:
On Thursday, 4 August 2022 at 13:18:40 UTC, kdevel wrote:
 At DConf '22 day 3 Robert Schadek presented at around 07:22:00 
 in the YT video the function `splitIds`. Given an HTML page 
 from bugzilla containing a list of issues `splitIds` aims at 
 extracting all bug-ids referenced within a specific url context:

 ```
 long [] splitIds (string page)
 {
    enum re = ctRegex!(`"show_bug.cgi\?id=[0-9]+"`);
    auto m = page.matchAll (re);

    return m
       .filter!(it => it.length > 0)                 // what is 
 this?
       .map!(it => it.front)                         // whole 
 match, it[0]
       .map!(it => it.find!(isNumber))               // searches 
 fist number
       .map!(it => it.until!(it => !it.isNumber ())) // last 
 number
       .filter!(it => !it.empty)                     // again an 
 empty check??? why?
       .map!(it => it.to!long ())
       .uniq                                         // .sort is 
 missing. IMHO saving at the wrong things?
       .array;
 }
 ```

 `m` contains all matches. It is a "list of lists" as one would 
 say in Perl. The "inner lists" contains as first element 
 ("`front`") the string which matches the whole pattern. So my 
 first question is:

 What is the purpose of the first filter call? Since the element 
 of `m` is a match it cannot have a length of 0.
 [...]
I think that the first one is to prevent to call `front()` on an empty range, excepted that according to the regex that should not happen. BTW I haven't washed the video but I suppose this is related to the migration of bugzilla to GH issues. I wonder why https://bugzilla.readthedocs.io/en/5.0/api/index.html#apis is not used instead.
Aug 05 2022