digitalmars.D - How is std.regex.replaceAllInto more efficient?

Shriramana Sharma (25/25) Oct 15 2015 Hello. The doc on std.regex.replace{First,All}Into reads:

Dmitry Olshansky (16/40) Oct 16 2015 It is dupped just because we are going to use it as _input_ for the

Shriramana Sharma (4/6) Oct 16 2015 That's much clear now, thanks!

Dmitry Olshansky (4/8) Oct 16 2015 Feel free to improve on the examples, I'd gladly pull such patches.

Shriramana Sharma (12/13) Oct 17 2015 How about this, using the "comifying" regex from the replaceAll example:

Dmitry Olshansky (8/19) Oct 17 2015 My problem with it is writeln/byLine b/c it can't be unittested.

Marc =?UTF-8?B?U2Now7x0eg==?= (5/25) Oct 17 2015 You can prefix it with `version(StdDdoc)`, then it will only
Shriramana Sharma (6/8) Oct 17 2015 That's true, but then the point was not to produce a valid unittest for ...

Dmitry Olshansky (6/12) Oct 17 2015 Listing code that is not routinely tested on each build means someday it...

Shriramana Sharma (26/29) Oct 18 2015 Hmmm... AFAICS the *Into function is most useful when you don't know how...

Dmitry Olshansky (12/39) Oct 18 2015 Looks good.

Shriramana Sharma (23/26) Oct 18 2015 Mmm no that throws an assertion failure, since the guarantee of the rege...

Dmitry Olshansky (10/34) Oct 18 2015 That would miss the zero. iota is half-open [) interval.

=?UTF-8?Q?S=c3=b6nke_Ludwig?= (8/19) Oct 19 2015 How about going one step further:

Shriramana Sharma (5/11) Oct 20 2015 Another option is to use KeepTerminator.yes. Using writeln() is shorter!...

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Hello. The doc on std.regex.replace{First,All}Into reads:

    A variation on replaceAll that instead of allocating a new string on 
each call outputs the result piece-wise to the sink. In particular this 
enables efficient construction of a final output incrementally.

    Example:

    //swap all 3 letter words and bring it back
    string text = "How are you doing?";
    auto sink = appender!(char[])();
    replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`));
    auto swapped = sink.data.dup; // make a copy explicitly
    assert(swapped == "woH era uoy doing?");
    sink.clear();
    replaceAllInto!(cap => retro(cap[0]))(sink, swapped, 
regex(`\b\w{3}\b`));
    assert(sink.data == text);

Now IIUC the code, there are only two calls to replaceAllInto, and after the 
first call the string is dupped, in which case there is an allocation 
nevertheless, so how does using *into make for efficiency?

Or is it that "each call" means each match of the regex? So there aren't 
just two calls, and that without the sink and *Into, each match of the regex 
will cause a new string allocation and this is what is being avoided?

If so, why can't the default implementation of replace{First,All} itself use 
an internal sink instead of needing the user to manually specify it?

Thanks!

Shriramana Sharma.

Oct 15 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 16-Oct-2015 04:50, Shriramana Sharma wrote:
 Hello. The doc on std.regex.replace{First,All}Into reads:

      A variation on replaceAll that instead of allocating a new string on
 each call outputs the result piece-wise to the sink. In particular this
 enables efficient construction of a final output incrementally.

      Example:

      //swap all 3 letter words and bring it back
      string text = "How are you doing?";
      auto sink = appender!(char[])();
      replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`));
      auto swapped = sink.data.dup; // make a copy explicitly
      assert(swapped == "woH era uoy doing?");
      sink.clear();
      replaceAllInto!(cap => retro(cap[0]))(sink, swapped,
 regex(`\b\w{3}\b`));
      assert(sink.data == text);

 Now IIUC the code, there are only two calls to replaceAllInto, and after the
 first call the string is dupped,

It is dupped just because we are going to use it as _input_ for the 
second call. If instead of swapped the second call would use some other 
source the intent of the example would be clearer. I agree it's not a 
great example.

 in which case there is an allocation
 nevertheless, so how does using *into make for efficiency?

Consider that the appender is being reused in 2 replaceAllInto calls, 
therefore the same allocated memory is reused. Otherwise one would have 
to allocate on each call to replace.

 Or is it that "each call" means each match of the regex? So there aren't
 just two calls, and that without the sink and *Into, each match of the regex
 will cause a new string allocation and this is what is being avoided?

No. Conceptually replace does replaceInto but creates a new appender 
each time. This is the overhead replaceInto means to avoid.

 If so, why can't the default implementation of replace{First,All} itself use
 an internal sink instead of needing the user to manually specify it?

It does. replaceFirst is replaceFirstInto + allocation of appender.
If you already have and appender you can reuse it. Even better if you 
were meaning to just print out the result you can pass it 
File("some-file").lockingTextWriter bypassing allocation completely.


-- 
Dmitry Olshansky

Oct 16 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Dmitry Olshansky wrote:

 No. Conceptually replace does replaceInto but creates a new appender
 each time. This is the overhead replaceInto means to avoid.

That's much clear now, thanks!

--

Oct 16 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 16-Oct-2015 18:23, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 No. Conceptually replace does replaceInto but creates a new appender
 each time. This is the overhead replaceInto means to avoid.

 That's much clear now, thanks!

Feel free to improve on the examples, I'd gladly pull such patches.

-- 
Dmitry Olshansky

Oct 16 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.

How about this, using the "comifying" regex from the replaceAll example:

    auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
    auto sink = appender!(char [])();
    foreach (line; stdin.byLine())
    {
        sink.clear();
        replaceAllInto(sink, line, re, ",");
        writeln(sink.data());
    }

--

Oct 17 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17-Oct-2015 13:06, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.

 How about this, using the "comifying" regex from the replaceAll example:

      auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
      auto sink = appender!(char [])();
      foreach (line; stdin.byLine())
      {
          sink.clear();
          replaceAllInto(sink, line, re, ",");
          writeln(sink.data());
      }

My problem with it is writeln/byLine b/c it can't be unittested. 
Documented unit-tests as examples are gold.

On the other hand you can post it to list of example shown on the first 
page of dlang.org because I think it's kind of cool.

Also a variation of this example can be put in std.regex synopsis.

-- 
Dmitry Olshansky

Oct 17 2015

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Saturday, 17 October 2015 at 10:54:19 UTC, Dmitry Olshansky 
wrote:
 On 17-Oct-2015 13:06, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such 
 patches.

 How about this, using the "comifying" regex from the 
 replaceAll example:

      auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
      auto sink = appender!(char [])();
      foreach (line; stdin.byLine())
      {
          sink.clear();
          replaceAllInto(sink, line, re, ",");
          writeln(sink.data());
      }

 My problem with it is writeln/byLine b/c it can't be 
 unittested. Documented unit-tests as examples are gold.

You can prefix it with `version(StdDdoc)`, then it will only 
appear in the documentation. It won't be really tested of course, 
but the compiler will at least do syntax checking on it, I guess.

Oct 17 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Dmitry Olshansky wrote:

 My problem with it is writeln/byLine b/c it can't be unittested.
 Documented unit-tests as examples are gold.

That's true, but then the point was not to produce a valid unittest for the 
*Into functions but a meaningful *example*, which will make clear to a user 
how this function is useful, no?

--

Oct 17 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 17-Oct-2015 16:43, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 My problem with it is writeln/byLine b/c it can't be unittested.
 Documented unit-tests as examples are gold.

 That's true, but then the point was not to produce a valid unittest for the
 *Into functions but a meaningful *example*, which will make clear to a user
 how this function is useful, no?

Listing code that is not routinely tested on each build means someday it 
may become broken. Anyway just issue a pull request, we can figure out 
the details in github discussion.

-- 
Dmitry Olshansky

Oct 17 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Dmitry Olshansky wrote:

 Listing code that is not routinely tested on each build means someday it
 may become broken. Anyway just issue a pull request, we can figure out
 the details in github discussion.

Hmmm... AFAICS the *Into function is most useful when you don't know how 
many items of input you are going to need to process and want to avoid an 
allocation at each item. The command-line byLine()/writeln() number 
delimiter makes a very sensible example there, so I would recommend you put 
that in *somewhere*, though it cannot be a unittest.

For a unittest, how about:

static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
auto sink = appender!(char [])();
enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19;
foreach (i; 0 .. 50)
{
    sink.clear();
    replaceAllInto(sink, text(uniform(min, max)), re, ",");
    // NOTE: size_t is an unsigned type, so it will wrap over when going 
below zero
    for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -= 
4)
        assert(sink.data[pos] == ',');
}

I didn't intentionally do that wrap-around trick, but I can't implicitly 
convert an unsigned type to a signed one either, and using to!long is 
equally as awkward, not to mention that it would not work for overly long 
strings (though that doesn't apply in the present case).

--

Oct 18 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 18-Oct-2015 17:26, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Listing code that is not routinely tested on each build means someday it
 may become broken. Anyway just issue a pull request, we can figure out
 the details in github discussion.

 Hmmm... AFAICS the *Into function is most useful when you don't know how
 many items of input you are going to need to process and want to avoid an
 allocation at each item. The command-line byLine()/writeln() number
 delimiter makes a very sensible example there, so I would recommend you put
 that in *somewhere*, though it cannot be a unittest.

Yes - e.g. add to std.regex synopsis to showcase this ability.

 For a unittest, how about:

Looks good.

 static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");

Just static

 auto sink = appender!(char [])();
 enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19;
 foreach (i; 0 .. 50)
 {
      sink.clear();
      replaceAllInto(sink, text(uniform(min, max)), re, ",");
      // NOTE: size_t is an unsigned type, so it will wrap over when going
 below zero
      for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -=
 4)
          assert(sink.data[pos] == ',');
 }

I guess the idiomatic way is:

foreach (pos; iota(0, sink.data.length, 4).retro)
{
...
}

 I didn't intentionally do that wrap-around trick, but I can't implicitly
 convert an unsigned type to a signed one either, and using to!long is
 equally as awkward, not to mention that it would not work for overly long
 strings (though that doesn't apply in the present case).

See above.

-- 
Dmitry Olshansky

Oct 18 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Dmitry Olshansky wrote:

 I guess the idiomatic way is:
 
 foreach (pos; iota(0, sink.data.length, 4).retro)

Mmm no that throws an assertion failure, since the guarantee of the regex is 
only to insert commas before every third digit *from the end*, but doing 
retro on an *ascending* iota fixes the end point at index zero, and start 
point at some multiple of four *from the beginning*, which isn't compatible 
with the regex's promise.

Thus the correct approach would be to do a *descending* iota:

        foreach (pos; iota(sink.data.length - 4, 0, -4))
            assert(sink.data[pos] == ',');

Pull: https://github.com/D-Programming-Language/phobos/pull/3731

I am not much experienced with GitHub pull requests, so I hope I did that 
correctly. IIUC after you pull, it will be back in my fork and its clone 
once I pull back from you, then I can delete my local branch, right?

BTW since I still didn't like that wraparound thing, (and it would fail in 
some extreme corner cases where sink.data.length is > size_t.max - 3 or 
such, I think) I thought of it much and arrived at the much more convoluted:

        for (size_t pos = sink.data.length; pos > 3; )
            { pos -= 4; assert(sink.data[pos] == ','); }

which is still not easy for one's mind to "wraparound"! So the descending 
iota solution we arrived above is the most elegant.

As I said, I'm still getting my "D-legs"...

--

Oct 18 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 18-Oct-2015 21:16, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 I guess the idiomatic way is:

 foreach (pos; iota(0, sink.data.length, 4).retro)

 Mmm no that throws an assertion failure, since the guarantee of the regex is
 only to insert commas before every third digit *from the end*, but doing
 retro on an *ascending* iota fixes the end point at index zero, and start
 point at some multiple of four *from the beginning*, which isn't compatible
 with the regex's promise.

 Thus the correct approach would be to do a *descending* iota:

          foreach (pos; iota(sink.data.length - 4, 0, -4))
              assert(sink.data[pos] == ',');

That would miss the zero. iota is half-open [) interval.
How about doing a bit of math and going for:

iota(sink.data.length % 4, sink.data.length, 4)

It really doesn't matter which way we go (forward/backward) but we must 
hit the right spots.


 Pull: https://github.com/D-Programming-Language/phobos/pull/3731

 I am not much experienced with GitHub pull requests, so I hope I did that
 correctly. IIUC after you pull, it will be back in my fork and its clone
 once I pull back from you, then I can delete my local branch, right?

Cool, will comment there.

 BTW since I still didn't like that wraparound thing, (and it would fail in
 some extreme corner cases where sink.data.length is > size_t.max - 3 or
 such, I think) I thought of it much and arrived at the much more convoluted:

          for (size_t pos = sink.data.length; pos > 3; )
              { pos -= 4; assert(sink.data[pos] == ','); }

 which is still not easy for one's mind to "wraparound"! So the descending
 iota solution we arrived above is the most elegant.

 As I said, I'm still getting my "D-legs"...

Welcome on board ;)

-- 
Dmitry Olshansky

Oct 18 2015

=?UTF-8?Q?S=c3=b6nke_Ludwig?= <sludwig outerproduct.org> writes:

Am 17.10.2015 um 12:06 schrieb Shriramana Sharma:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.

 How about this, using the "comifying" regex from the replaceAll example:

      auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
      auto sink = appender!(char [])();
      foreach (line; stdin.byLine())
      {
          sink.clear();
          replaceAllInto(sink, line, re, ",");
          writeln(sink.data());
      }

How about going one step further:

     auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
     foreach (line; stdin.byLine())
     {
         replaceAllInto(stdout.lockingTextWriter, line, re, ",");
         writeln();
     }

Oct 19 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Sönke Ludwig wrote:

 auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
 foreach (line; stdin.byLine())
 {
 replaceAllInto(stdout.lockingTextWriter, line, re, ",");
 writeln();
 }

Another option is to use KeepTerminator.yes. Using writeln() is shorter! :-)

So this one does avoids all the extra allocations due to use of *Into, then?

--

Oct 20 2015

D Programming

C/C++ Programming

Other

digitalmars.D - How is std.regex.replaceAllInto more efficient?