www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How is std.regex.replaceAllInto more efficient?

reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Hello. The doc on std.regex.replace{First,All}Into reads:

    A variation on replaceAll that instead of allocating a new string on 
each call outputs the result piece-wise to the sink. In particular this 
enables efficient construction of a final output incrementally.

    Example:

    //swap all 3 letter words and bring it back
    string text = "How are you doing?";
    auto sink = appender!(char[])();
    replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`));
    auto swapped = sink.data.dup; // make a copy explicitly
    assert(swapped == "woH era uoy doing?");
    sink.clear();
    replaceAllInto!(cap => retro(cap[0]))(sink, swapped, 
regex(`\b\w{3}\b`));
    assert(sink.data == text);

Now IIUC the code, there are only two calls to replaceAllInto, and after the 
first call the string is dupped, in which case there is an allocation 
nevertheless, so how does using *into make for efficiency?

Or is it that "each call" means each match of the regex? So there aren't 
just two calls, and that without the sink and *Into, each match of the regex 
will cause a new string allocation and this is what is being avoided?

If so, why can't the default implementation of replace{First,All} itself use 
an internal sink instead of needing the user to manually specify it?

Thanks!

Shriramana Sharma.
Oct 15 2015
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 16-Oct-2015 04:50, Shriramana Sharma wrote:
 Hello. The doc on std.regex.replace{First,All}Into reads:

      A variation on replaceAll that instead of allocating a new string on
 each call outputs the result piece-wise to the sink. In particular this
 enables efficient construction of a final output incrementally.

      Example:

      //swap all 3 letter words and bring it back
      string text = "How are you doing?";
      auto sink = appender!(char[])();
      replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`));
      auto swapped = sink.data.dup; // make a copy explicitly
      assert(swapped == "woH era uoy doing?");
      sink.clear();
      replaceAllInto!(cap => retro(cap[0]))(sink, swapped,
 regex(`\b\w{3}\b`));
      assert(sink.data == text);

 Now IIUC the code, there are only two calls to replaceAllInto, and after the
 first call the string is dupped,
It is dupped just because we are going to use it as _input_ for the second call. If instead of swapped the second call would use some other source the intent of the example would be clearer. I agree it's not a great example.
 in which case there is an allocation
 nevertheless, so how does using *into make for efficiency?
Consider that the appender is being reused in 2 replaceAllInto calls, therefore the same allocated memory is reused. Otherwise one would have to allocate on each call to replace.
 Or is it that "each call" means each match of the regex? So there aren't
 just two calls, and that without the sink and *Into, each match of the regex
 will cause a new string allocation and this is what is being avoided?
No. Conceptually replace does replaceInto but creates a new appender each time. This is the overhead replaceInto means to avoid.
 If so, why can't the default implementation of replace{First,All} itself use
 an internal sink instead of needing the user to manually specify it?
It does. replaceFirst is replaceFirstInto + allocation of appender. If you already have and appender you can reuse it. Even better if you were meaning to just print out the result you can pass it File("some-file").lockingTextWriter bypassing allocation completely. -- Dmitry Olshansky
Oct 16 2015
parent reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Dmitry Olshansky wrote:

 No. Conceptually replace does replaceInto but creates a new appender
 each time. This is the overhead replaceInto means to avoid.
That's much clear now, thanks! --
Oct 16 2015
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 16-Oct-2015 18:23, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 No. Conceptually replace does replaceInto but creates a new appender
 each time. This is the overhead replaceInto means to avoid.
That's much clear now, thanks!
Feel free to improve on the examples, I'd gladly pull such patches. -- Dmitry Olshansky
Oct 16 2015
parent reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.
How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); } --
Oct 17 2015
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17-Oct-2015 13:06, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.
How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
My problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold. On the other hand you can post it to list of example shown on the first page of dlang.org because I think it's kind of cool. Also a variation of this example can be put in std.regex synopsis. -- Dmitry Olshansky
Oct 17 2015
next sibling parent Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Saturday, 17 October 2015 at 10:54:19 UTC, Dmitry Olshansky 
wrote:
 On 17-Oct-2015 13:06, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such 
 patches.
How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
My problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold.
You can prefix it with `version(StdDdoc)`, then it will only appear in the documentation. It won't be really tested of course, but the compiler will at least do syntax checking on it, I guess.
Oct 17 2015
prev sibling parent reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Dmitry Olshansky wrote:

 My problem with it is writeln/byLine b/c it can't be unittested.
 Documented unit-tests as examples are gold.
That's true, but then the point was not to produce a valid unittest for the *Into functions but a meaningful *example*, which will make clear to a user how this function is useful, no? --
Oct 17 2015
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 17-Oct-2015 16:43, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 My problem with it is writeln/byLine b/c it can't be unittested.
 Documented unit-tests as examples are gold.
That's true, but then the point was not to produce a valid unittest for the *Into functions but a meaningful *example*, which will make clear to a user how this function is useful, no?
Listing code that is not routinely tested on each build means someday it may become broken. Anyway just issue a pull request, we can figure out the details in github discussion. -- Dmitry Olshansky
Oct 17 2015
parent reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Dmitry Olshansky wrote:

 Listing code that is not routinely tested on each build means someday it
 may become broken. Anyway just issue a pull request, we can figure out
 the details in github discussion.
Hmmm... AFAICS the *Into function is most useful when you don't know how many items of input you are going to need to process and want to avoid an allocation at each item. The command-line byLine()/writeln() number delimiter makes a very sensible example there, so I would recommend you put that in *somewhere*, though it cannot be a unittest. For a unittest, how about: static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19; foreach (i; 0 .. 50) { sink.clear(); replaceAllInto(sink, text(uniform(min, max)), re, ","); // NOTE: size_t is an unsigned type, so it will wrap over when going below zero for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -= 4) assert(sink.data[pos] == ','); } I didn't intentionally do that wrap-around trick, but I can't implicitly convert an unsigned type to a signed one either, and using to!long is equally as awkward, not to mention that it would not work for overly long strings (though that doesn't apply in the present case). --
Oct 18 2015
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 18-Oct-2015 17:26, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 Listing code that is not routinely tested on each build means someday it
 may become broken. Anyway just issue a pull request, we can figure out
 the details in github discussion.
Hmmm... AFAICS the *Into function is most useful when you don't know how many items of input you are going to need to process and want to avoid an allocation at each item. The command-line byLine()/writeln() number delimiter makes a very sensible example there, so I would recommend you put that in *somewhere*, though it cannot be a unittest.
Yes - e.g. add to std.regex synopsis to showcase this ability.
 For a unittest, how about:
Looks good.
 static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
Just static
 auto sink = appender!(char [])();
 enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19;
 foreach (i; 0 .. 50)
 {
      sink.clear();
      replaceAllInto(sink, text(uniform(min, max)), re, ",");
      // NOTE: size_t is an unsigned type, so it will wrap over when going
 below zero
      for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -=
 4)
          assert(sink.data[pos] == ',');
 }
I guess the idiomatic way is: foreach (pos; iota(0, sink.data.length, 4).retro) { ... }
 I didn't intentionally do that wrap-around trick, but I can't implicitly
 convert an unsigned type to a signed one either, and using to!long is
 equally as awkward, not to mention that it would not work for overly long
 strings (though that doesn't apply in the present case).
See above. -- Dmitry Olshansky
Oct 18 2015
parent reply Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Dmitry Olshansky wrote:

 I guess the idiomatic way is:
 
 foreach (pos; iota(0, sink.data.length, 4).retro)
Mmm no that throws an assertion failure, since the guarantee of the regex is only to insert commas before every third digit *from the end*, but doing retro on an *ascending* iota fixes the end point at index zero, and start point at some multiple of four *from the beginning*, which isn't compatible with the regex's promise. Thus the correct approach would be to do a *descending* iota: foreach (pos; iota(sink.data.length - 4, 0, -4)) assert(sink.data[pos] == ','); Pull: https://github.com/D-Programming-Language/phobos/pull/3731 I am not much experienced with GitHub pull requests, so I hope I did that correctly. IIUC after you pull, it will be back in my fork and its clone once I pull back from you, then I can delete my local branch, right? BTW since I still didn't like that wraparound thing, (and it would fail in some extreme corner cases where sink.data.length is > size_t.max - 3 or such, I think) I thought of it much and arrived at the much more convoluted: for (size_t pos = sink.data.length; pos > 3; ) { pos -= 4; assert(sink.data[pos] == ','); } which is still not easy for one's mind to "wraparound"! So the descending iota solution we arrived above is the most elegant. As I said, I'm still getting my "D-legs"... --
Oct 18 2015
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 18-Oct-2015 21:16, Shriramana Sharma wrote:
 Dmitry Olshansky wrote:

 I guess the idiomatic way is:

 foreach (pos; iota(0, sink.data.length, 4).retro)
Mmm no that throws an assertion failure, since the guarantee of the regex is only to insert commas before every third digit *from the end*, but doing retro on an *ascending* iota fixes the end point at index zero, and start point at some multiple of four *from the beginning*, which isn't compatible with the regex's promise. Thus the correct approach would be to do a *descending* iota: foreach (pos; iota(sink.data.length - 4, 0, -4)) assert(sink.data[pos] == ',');
That would miss the zero. iota is half-open [) interval. How about doing a bit of math and going for: iota(sink.data.length % 4, sink.data.length, 4) It really doesn't matter which way we go (forward/backward) but we must hit the right spots.
 Pull: https://github.com/D-Programming-Language/phobos/pull/3731

 I am not much experienced with GitHub pull requests, so I hope I did that
 correctly. IIUC after you pull, it will be back in my fork and its clone
 once I pull back from you, then I can delete my local branch, right?
Cool, will comment there.
 BTW since I still didn't like that wraparound thing, (and it would fail in
 some extreme corner cases where sink.data.length is > size_t.max - 3 or
 such, I think) I thought of it much and arrived at the much more convoluted:

          for (size_t pos = sink.data.length; pos > 3; )
              { pos -= 4; assert(sink.data[pos] == ','); }

 which is still not easy for one's mind to "wraparound"! So the descending
 iota solution we arrived above is the most elegant.

 As I said, I'm still getting my "D-legs"...
Welcome on board ;) -- Dmitry Olshansky
Oct 18 2015
prev sibling parent reply =?UTF-8?Q?S=c3=b6nke_Ludwig?= <sludwig outerproduct.org> writes:
Am 17.10.2015 um 12:06 schrieb Shriramana Sharma:
 Dmitry Olshansky wrote:

 Feel free to improve on the examples, I'd gladly pull such patches.
How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
How about going one step further: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); foreach (line; stdin.byLine()) { replaceAllInto(stdout.lockingTextWriter, line, re, ","); writeln(); }
Oct 19 2015
parent Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:
Sönke Ludwig wrote:

 auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");
 foreach (line; stdin.byLine())
 {
 replaceAllInto(stdout.lockingTextWriter, line, re, ",");
 writeln();
 }
Another option is to use KeepTerminator.yes. Using writeln() is shorter! :-) So this one does avoids all the extra allocations due to use of *Into, then? --
Oct 20 2015