digitalmars.D.learn - Searching strings with indexOf vs countUntil

Anonymouse (43/43) May 25 2017 I was profiling my program with callgrind and saw that a lot of

Kagamin (6/6) May 25 2017 I would guess indexOf returns a value suitable for indexing,

Basile B. (3/9) May 25 2017 To get rid of decoding he can cast to ubyte[]. I would do that if
Anonymouse (13/21) May 25 2017 I see. I would have thought indexOf would be more keen to decode,

Stanislav Blinov (5/10) May 25 2017 There's a less ugly cast in std.string:

Anonymouse <asdf asdf.net> writes:

I was profiling my program with callgrind and saw that a lot of 
time was spent in countUntil (std.algorithm.searching) on 
strings. I had chosen to use it instead of indexOf (std.string), 
with the plain assumption that countUntil wouldn't decode, while 
indexOf would.

Comparing microbenchmarks between the two I see that countUntil 
is some 7x slower.


import std.stdio : writeln;
import std.string : indexOf;
import std.algorithm.searching : countUntil;
import std.datetime;

enum line = ":zorael!~NaN asdf.asdf.asdf PRIVMSG #d :Hello 
world!";
enum iterations = 1_000_000;

void main()
{
     StopWatch sw;

     sw.start();
     foreach (i; 0..iterations)
     {
         line.indexOf(" :");
     }
     sw.stop();

     immutable usecsIndexOf = sw.peek().usecs;
	
     sw.reset();
     sw.start();
     foreach (i; 0..iterations)
     {
         line.countUntil(" :");
     }
     sw.stop();

     immutable usecsCountUntil = sw.peek().usecs;

     writeln("indexOf:    ", usecsIndexOf.usecs);
     writeln("countUntil: ", usecsCountUntil.usecs);
     writeln(cast(double)usecsCountUntil/usecsIndexOf, "x");
}


https://dpaste.dzfl.pl/0319edb79ec8 gives the output:

     indexOf:    160 ms and 534 μs
     countUntil: 1 sec, 146 ms, and 842 μs
     7.14392x

What is the fundamental difference between the two? When should I 
ever use countUntil on strings?

May 25 2017

Kagamin <spam here.lot> writes:

I would guess indexOf returns a value suitable for indexing, 
therefore it counts code units, while countUntil counts range 
elements - code points in case of a string. Also number of code 
points is not suitable for indexing an utf8 string, it can be 
used to allocate a dstring, but not so much for anything else. 
What do you use the resulting value for?

May 25 2017

Basile B. <b2.temp gmx.com> writes:

On Thursday, 25 May 2017 at 11:55:21 UTC, Kagamin wrote:
 I would guess indexOf returns a value suitable for indexing, 
 therefore it counts code units, while countUntil counts range 
 elements - code points in case of a string. Also number of code 
 points is not suitable for indexing an utf8 string, it can be 
 used to allocate a dstring, but not so much for anything else. 
 What do you use the resulting value for?

To get rid of decoding he can cast to ubyte[]. I would do that if 
sure that the input is only made of ascii chars.

May 25 2017

Anonymouse <asdf asdf.net> writes:

On Thursday, 25 May 2017 at 11:55:21 UTC, Kagamin wrote:
 I would guess indexOf returns a value suitable for indexing, 
 therefore it counts code units, while countUntil counts range 
 elements - code points in case of a string. Also number of code 
 points is not suitable for indexing an utf8 string, it can be 
 used to allocate a dstring, but not so much for anything else. 
 What do you use the resulting value for?

I see. I would have thought indexOf would be more keen to decode, 
but that's bias talking.

The project is an IRC bot. I use indexOf/used countUntil to slice 
up strings into one part leading up to some separator (" :" in 
that example), and another into everything after it. See 
https://dpaste.dzfl.pl/326e450058c1.


On Thursday, 25 May 2017 at 12:46:43 UTC, Basile B. wrote:
 To get rid of decoding he can cast to ubyte[]. I would do that 
 if sure that the input is only made of ascii chars.

Part of the strings I'm working with can be assumed to be only 
ASCII, yes. indexOf only wants strings or char[]s, but 
interestingly if I use the same benchmark but have countUntil 
work on raw ubyte[]s, it is faster. See 
https://dpaste.dzfl.pl/2e095f7d18be.

May 25 2017

Stanislav Blinov <stanislav.blinov gmail.com> writes:

On Thursday, 25 May 2017 at 18:13:15 UTC, Anonymouse wrote:

 Part of the strings I'm working with can be assumed to be only 
 ASCII, yes. indexOf only wants strings or char[]s, but 
 interestingly if I use the same benchmark but have countUntil 
 work on raw ubyte[]s, it is faster. See 
 https://dpaste.dzfl.pl/2e095f7d18be.

There's a less ugly cast in std.string:

import std.string : representation;

//(cast(ubyte[])line).countUntil(cast(ubyte[])" :");
line.representation.countUntil(" :".representation);

May 25 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Searching strings with indexOf vs countUntil