digitalmars.D.announce - 4x faster strlen with 4 char sentinel

Jay Norwood (16/16) Jun 26 2016 After watching Andre's sentinel thing, I'm playing with strlen on

David Nadlinger (13/24) Jun 26 2016 Please keep general discussions like this off the announce list,

Jay Norwood (11/22) Jun 26 2016 This is more of just an observation that the choice of the single

chmike (2/2) Jun 26 2016 If you store the string size in a four byte field, you get a hard

chmike (10/12) Jun 26 2016 If you have the constrain to work with null terminated strings,

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/12) Jun 26 2016 Not only to save space, some CPUs also had cheap incrementing

Jay Norwood (10/15) Jun 27 2016 This strlen2 doesn't require special alignment or casting of char

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (12/15) Jun 27 2016 Yes, and the idea of speeding up strings by padding out with

Jay Norwood (14/21) Jun 27 2016 Your link's use of padding pads out with a variable number of

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/22) Jun 27 2016 That's true, and it is fun to think about different string

Jay Norwood (18/27) Jun 27 2016 I doubt the external memory accesses are involved in these

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (5/8) Jun 27 2016 But this is a rather atypical use case for zero terminated

Mike Parker (4/10) Jun 27 2016 The same reason float and double are default initialized to nan.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/25) Jun 27 2016 I incidentally just found my 1992 implementation for Motorola

Brad Roberts via Digitalmars-d-announce (7/31) Jun 27 2016 Yup.. there's a reason that many many hours have been spent optimizing

deadalnix (3/19) Jun 27 2016 If we were in interview, I'd ask you "what does this returns if

Jay Norwood (15/17) Jun 27 2016 I'd say use this one instead, to avoid negative size_t. It is

Jay Norwood (11/14) Jun 27 2016 oops. I see ... need to test for empty string.

Bauss (7/21) Jun 27 2016 Why not just do

deadalnix (2/19) Jun 28 2016 If we were in an interview, I would insist.

Sebastiaan Koppe (5/27) Jun 28 2016 Since no one is answering:

Jay Norwood (4/10) Jun 28 2016 Yes, the test checked for 0 length but not with a preceding 0. I

qznc (3/19) Jun 28 2016 Did you also compare to strlen from libc? I'd guess GNU libc uses

Jay Norwood (16/18) Jun 28 2016 I did test with the libc strlen, although the D libraries did not

Jay Norwood <jayn prismnet.com> writes:

After watching Andre's sentinel thing, I'm playing with strlen on 
char strings with 4 terminating 0s instead of a single one.  
Seems to work and is 4x faster compared to the runtime version.

nothrow pure size_t strlen2(const(char)* c) {
  if (c is null)
    return 0;
  size_t l=0;
  while (*c){ c+=4; l+=4;}
  while (*c==0){ c--; l--;}
  return l+1;
}

This is the timing of my test case, which I can post if anyone is 
interested.
strlen\Release>strlen
2738
681

Jun 26 2016

David Nadlinger <code klickverbot.at> writes:

Hi Jay,

On Sunday, 26 June 2016 at 16:40:08 UTC, Jay Norwood wrote:
 After watching Andre's sentinel thing, I'm playing with strlen 
 on char strings with 4 terminating 0s instead of a single one.  
 Seems to work and is 4x faster compared to the runtime version.

Please keep general discussions like this off the announce list, 
which would e.g. be suitable for announcing a fleshed out 
collection of high-performance string handling routines.

 nothrow pure size_t strlen2(const(char)* c) {
  if (c is null)
    return 0;
  size_t l=0;
  while (*c){ c+=4; l+=4;}
  while (*c==0){ c--; l--;}
  return l+1;
 }

A couple of quick hints:
  - This is not a correct implementation of strlen, as it already 
assumes that the array is terminated by four zero bytes. That 
iterating memory with a stride of 4 instead of 1 will be faster 
is a self-evident truth.
  - You should be benchmarking against a "proper" SIMD-optimised 
strlen implementation.

  — David

Jun 26 2016

Jay Norwood <jayn prismnet.com> writes:

On Sunday, 26 June 2016 at 16:59:54 UTC, David Nadlinger wrote:
 Please keep general discussions like this off the announce 
 list, which would e.g. be suitable for announcing a fleshed out 
 collection of high-performance string handling routines.

 A couple of quick hints:
  - This is not a correct implementation of strlen, as it 
 already assumes that the array is terminated by four zero 
 bytes. That iterating memory with a stride of 4 instead of 1 
 will be faster is a self-evident truth.
  - You should be benchmarking against a "proper" SIMD-optimised 
 strlen implementation.

  — David


This is more of just an observation that the choice of the single 
zero sentinel for C string termination comes at a cost of 4x 
strlen speed vs using four terminating zeros.

I don't see a SIMD strlen implementation in the D libraries.

The strlen2 function I posted works on any string that is 
terminated by four zeros, and returns the same len as strlen in 
that case, but much faster.

How to get strings initialized with four terminating zeros at 
compile time is a separate issue.  I don't know the solution, 
else I might consider doing more with this.

Jun 26 2016

chmike <christophe meessen.net> writes:

If you store the string size in a four byte field, you get a hard 
to beat fast strlen. :)

Jun 26 2016

chmike <christophe meessen.net> writes:

On Monday, 27 June 2016 at 05:10:37 UTC, chmike wrote:
 If you store the string size in a four byte field, you get a 
 hard to beat fast strlen. :)

If you have the constrain to work with null terminated strings, 
then it could be interesting to look at SIMD as was suggested or 
use one of the following algorithms to test the presence of a 0 
byte in a 4 byte or 8 byte integer when simd is not available.

https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

Ending strings with a single null byte/char is to save space. It 
was critical in the 70´s when C was created and memory space was 
very limited. That's not the case anymore and I guess the reason 
why you assume that ending a string with 4 null chars is OK.

Jun 26 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 27 June 2016 at 05:27:12 UTC, chmike wrote:
 Ending strings with a single null byte/char is to save space. 
 It was critical in the 70´s when C was created and memory space 
 was very limited. That's not the case anymore and I guess the

Not only to save space, some CPUs also had cheap incrementing 
load/stores and branching on zero is faster than sacrificing 
another register for a counter.

IIRC Pascal has a short string with a single byte length.

Besides there are plenty of other advantages to using a 
terminating sentinel depending on the use scenario. E.g. if you 
want many versions of the same tail or if you are splitting a 
string at white space (overwrite a white space char with a zero).

Jun 26 2016

Jay Norwood <jayn prismnet.com> writes:

On Monday, 27 June 2016 at 06:31:49 UTC, Ola Fosheim Grøstad 
wrote:
 Besides there are plenty of other advantages to using a 
 terminating sentinel depending on the use scenario. E.g. if you 
 want many versions of the same tail or if you are splitting a 
 string at white space (overwrite a white space char with a 
 zero).

This strlen2 doesn't require special alignment or casting of char 
pointer types to some larger type. That keeps the strlen2 
implementation fairly simple.

The implementation is only testing one char per increment.  It 
doesn't require the extra xor processing used in some of the 
examples.

I haven't checked if there is a strlen for dchar or wchar, but it 
should also speed up those.

Jun 27 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 27 June 2016 at 16:22:56 UTC, Jay Norwood wrote:
 This strlen2 doesn't require special alignment or casting of 
 char pointer types to some larger type. That keeps the strlen2 
 implementation fairly simple.

Yes, and the idea of speeding up strings by padding out with 
zeros is not new. ;-) I recall suggesting it back in 1999 when 
discussing the benefits of having a big endian cpu when sorting 
strings. If it is big endian you can compare ascii as 32/64 bit 
integers, so if you align the string and pad out with zeros then 
you can speed up strcmp() by a significant factor. Oh, here it is:

http://disinterest.org/resource/MUD-Dev/1999q1/009759.html

Of course, this is all moot now, little endian + simd has made 
such tricks redundant. Simd probably makes your strlen2 redundant 
too. The bottle neck tends to be memory access/prefetching for 
simple algorithms.

Jun 27 2016

Jay Norwood <jayn prismnet.com> writes:

On Monday, 27 June 2016 at 16:38:58 UTC, Ola Fosheim Grøstad 
wrote:
 Yes, and the idea of speeding up strings by padding out with 
 zeros is not new. ;-) I recall suggesting it back in 1999 when 
 discussing the benefits of having a big endian cpu when sorting 
 strings. If it is big endian you can compare ascii as 32/64 bit 
 integers, so if you align the string and pad out with zeros 
 then you can speed up strcmp() by a significant factor. Oh, 
 here it is:

Your link's use of padding pads out with a variable number of 
zeros, so that a larger data type can be used for the compare 
operations.  This isn't the same as my example, which is simpler 
due to not having to fiddle with alignment and data type casting.

I didn't find a strlen implementation for dchar or wchar in the D 
libraries.

I also found it strange, the non-zero initialization values for 
char, dchar, wchar.  I suppose there's some reason?

int [100]  to zeros.
char [100]  to 0xff;
dchar [100]   to 0xffff;
wchar [100]   to 0xffff;

Jun 27 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 27 June 2016 at 19:51:48 UTC, Jay Norwood wrote:
 Your link's use of padding pads out with a variable number of 
 zeros, so that a larger data type can be used for the compare 
 operations.  This isn't the same as my example, which is 
 simpler due to not having to fiddle with alignment and data 
 type casting.

That's true, and it is fun to think about different string 
implementations. Just keep in mind that prior to the 90s, text 
was the essential datatype for many programmers and inventing new 
ways to do strings is heavily explored. I remember the first 
exercise we got at the university when doing the OS course was to 
implement "strlen", "strcpy" and "strcmp" in C or machine 
language. It can be fun.

Just keep in mind that the major bottleneck now is loading 64 
bytes from memory into cache. So if you test performance you have 
to make sure to invalidate the caches before you test and test 
with spurious reads over a very large memory area to get 
realistic results.

But essentially, the operation is not heavy, so to speed it up 
you need to predict and prefetch from memory in time, meaning no 
library solution is sufficient. (you need to prefetch memory way 
before your library function is called)

Jun 27 2016

Jay Norwood <jayn prismnet.com> writes:

On Monday, 27 June 2016 at 20:43:40 UTC, Ola Fosheim Grøstad 
wrote:
 Just keep in mind that the major bottleneck now is loading 64 
 bytes from memory into cache. So if you test performance you 
 have to make sure to invalidate the caches before you test and 
 test with spurious reads over a very large memory area to get 
 realistic results.

 But essentially, the operation is not heavy, so to speed it up 
 you need to predict and prefetch from memory in time, meaning 
 no library solution is sufficient. (you need to prefetch memory 
 way before your library function is called)

I doubt the external memory accesses are involved in these 
measurements. I'm using a 100KB char array terminated by four 
zeros, and doing strlen on substring pointers into it incremented 
by 1 for 100K times.  The middle of the three timings is for 
strlen2, while the two outer timings are for strlen during the 
same program execution.

I'm initializing the 100KB immediately prior to the measurement. 
The 100KB array should all be in L1 or L2 cache by the time I 
make even the first of the three time measurements.

The prefetch shouldn't have a problem predicting this.

2749
688
2783

2741
683
2738

Jun 27 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 27 June 2016 at 21:41:57 UTC, Jay Norwood wrote:
 measurements. I'm using a 100KB char array terminated by four 
 zeros, and doing strlen on substring pointers into it 
 incremented by 1 for 100K times.

But this is a rather atypical use case for zero terminated 
strings? It would make more sense to test it on a massive amount 
of short strings stuffed into a very large hash-table. 
(filenames, keys, user names, email adresses etc)

Jun 27 2016

Mike Parker <aldacron gmail.com> writes:

On Monday, 27 June 2016 at 19:51:48 UTC, Jay Norwood wrote:

 I also found it strange, the non-zero initialization values for 
 char, dchar, wchar.  I suppose there's some reason?

 int [100]  to zeros.
 char [100]  to 0xff;
 dchar [100]   to 0xffff;
 wchar [100]   to 0xffff;

The same reason float and double are default initialized to nan. 
char, wchar and dchar are default initialized to invalid unicode 
values.

Jun 27 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Monday, 27 June 2016 at 06:31:49 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 27 June 2016 at 05:27:12 UTC, chmike wrote:
 Ending strings with a single null byte/char is to save space. 
 It was critical in the 70´s when C was created and memory 
 space was very limited. That's not the case anymore and I 
 guess the

 Not only to save space, some CPUs also had cheap incrementing 
 load/stores and branching on zero is faster than sacrificing 
 another register for a counter.

I incidentally just found my 1992 implementation for Motorola 
68K, to illustrate:

_mystrcpy		
		move.l	4(sp),a1	; pointer for destination
		move.l	8(sp),a0	; pointer for source

mystrcpy	move.l	a0,d0
1$		move.b	(a0)+,(a1)+ ; copy
		bne.s	1$		   ; jump back up if not zero
		rts

As you can see it is a tight loop. Other CPUs are even tighter, 
and have single-instruction loops (even 8086?)

So not only storage, also performance on specific CPUs. Which is 
a good reason for keeping datatypes in standard libraries 
abstract, different CPUs favour different representations. Even 
on very basic datatypes.

Jun 27 2016

Brad Roberts via Digitalmars-d-announce writes:

On 6/26/2016 11:47 AM, Jay Norwood via Digitalmars-d-announce wrote:
 On Sunday, 26 June 2016 at 16:59:54 UTC, David Nadlinger wrote:
 Please keep general discussions like this off the announce list, which
 would e.g. be suitable for announcing a fleshed out collection of
 high-performance string handling routines.

 A couple of quick hints:
  - This is not a correct implementation of strlen, as it already
 assumes that the array is terminated by four zero bytes. That
 iterating memory with a stride of 4 instead of 1 will be faster is a
 self-evident truth.
  - You should be benchmarking against a "proper" SIMD-optimised strlen
 implementation.

  — David


 This is more of just an observation that the choice of the single zero
 sentinel for C string termination comes at a cost of 4x strlen speed vs
 using four terminating zeros.

 I don't see a SIMD strlen implementation in the D libraries.

 The strlen2 function I posted works on any string that is terminated by
 four zeros, and returns the same len as strlen in that case, but much
 faster.

 How to get strings initialized with four terminating zeros at compile
 time is a separate issue.  I don't know the solution, else I might
 consider doing more with this.

Yup.. there's a reason that many many hours have been spent optimizing 
strlen and other memory related length and comparison routines.  They 
are used a lot and the number of ways of making them fast varies almost 
as much as the number of cpu's that exist.  This effort is embedded in 
the code gen of compilers (other than dmd) and libc runtimes.  Trying to 
re-invent it is noble, and very educational, but largely redundant.

Jun 27 2016

deadalnix <deadalnix gmail.com> writes:

On Sunday, 26 June 2016 at 16:40:08 UTC, Jay Norwood wrote:
 After watching Andre's sentinel thing, I'm playing with strlen 
 on char strings with 4 terminating 0s instead of a single one.  
 Seems to work and is 4x faster compared to the runtime version.

 nothrow pure size_t strlen2(const(char)* c) {
  if (c is null)
    return 0;
  size_t l=0;
  while (*c){ c+=4; l+=4;}
  while (*c==0){ c--; l--;}
  return l+1;
 }

 This is the timing of my test case, which I can post if anyone 
 is interested.
 strlen\Release>strlen
 2738
 681

If we were in interview, I'd ask you "what does this returns if 
you pass it an empty string ?"

Jun 27 2016

Jay Norwood <jayn prismnet.com> writes:

On Tuesday, 28 June 2016 at 01:53:22 UTC, deadalnix wrote:
 If we were in interview, I'd ask you "what does this returns if 
 you pass it an empty string ?"

I'd say use this one instead, to avoid negative size_t. It is 
also a little faster for the same measurement.

nothrow pure size_t strlen2(const(char)* c) {
   if (c is null)
      return 0;
   const(char)* c_save = c;
   while (*c){ c+=4; }
   while (*c==0){ c--; }
   c++;
   return c - c_save;
}

2738
540
2744

Jun 27 2016

Jay Norwood <jayn prismnet.com> writes:

On Tuesday, 28 June 2016 at 03:11:26 UTC, Jay Norwood wrote:
 On Tuesday, 28 June 2016 at 01:53:22 UTC, deadalnix wrote:
 If we were in interview, I'd ask you "what does this returns 
 if you pass it an empty string ?"


oops.  I see ... need to test for empty string.

nothrow pure size_t strlen2(const(char)* c) {
   if (c is null || *c==0)
      return 0;
   const(char)* c_save = c;
   while (*c){ c+=4; }
   while (*c==0){ c--; }
   c++;
   return c - c_save;
}

Jun 27 2016

Bauss <jj_1337 live.dk> writes:

On Tuesday, 28 June 2016 at 03:58:23 UTC, Jay Norwood wrote:
 On Tuesday, 28 June 2016 at 03:11:26 UTC, Jay Norwood wrote:
 On Tuesday, 28 June 2016 at 01:53:22 UTC, deadalnix wrote:
 If we were in interview, I'd ask you "what does this returns 
 if you pass it an empty string ?"


 oops.  I see ... need to test for empty string.

 nothrow pure size_t strlen2(const(char)* c) {
   if (c is null || *c==0)
      return 0;
   const(char)* c_save = c;
   while (*c){ c+=4; }
   while (*c==0){ c--; }
   c++;
   return c - c_save;
 }

Why not just do

nothrow pure size_t strlen2(const(char)* c) {
   if (!c)
     return 0;

   ...
}

Jun 27 2016

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 28 June 2016 at 03:11:26 UTC, Jay Norwood wrote:
 On Tuesday, 28 June 2016 at 01:53:22 UTC, deadalnix wrote:
 If we were in interview, I'd ask you "what does this returns 
 if you pass it an empty string ?"

 I'd say use this one instead, to avoid negative size_t. It is 
 also a little faster for the same measurement.

 nothrow pure size_t strlen2(const(char)* c) {
   if (c is null)
      return 0;
   const(char)* c_save = c;
   while (*c){ c+=4; }
   while (*c==0){ c--; }
   c++;
   return c - c_save;
 }

 2738
 540
 2744

If we were in an interview, I would insist.

Jun 28 2016

Sebastiaan Koppe <mail skoppe.eu> writes:

On Tuesday, 28 June 2016 at 01:53:22 UTC, deadalnix wrote:
 On Sunday, 26 June 2016 at 16:40:08 UTC, Jay Norwood wrote:
 After watching Andre's sentinel thing, I'm playing with strlen 
 on char strings with 4 terminating 0s instead of a single one.
  Seems to work and is 4x faster compared to the runtime 
 version.

 nothrow pure size_t strlen2(const(char)* c) {
  if (c is null)
    return 0;
  size_t l=0;
  while (*c){ c+=4; l+=4;}
  while (*c==0){ c--; l--;}
  return l+1;
 }

 This is the timing of my test case, which I can post if anyone 
 is interested.
 strlen\Release>strlen
 2738
 681

 If we were in interview, I'd ask you "what does this returns if 
 you pass it an empty string ?"

Since no one is answering:

It depends on the memory right before c. But if there is at least 
one 0 right before it - which is quite likely - then you get some 
crazy big number returned.

Jun 28 2016

Jay Norwood <jayn prismnet.com> writes:

On Tuesday, 28 June 2016 at 09:31:46 UTC, Sebastiaan Koppe wrote:
 If we were in interview, I'd ask you "what does this returns 
 if you pass it an empty string ?"

 Since no one is answering:

 It depends on the memory right before c. But if there is at 
 least one 0 right before it - which is quite likely - then you 
 get some crazy big number returned.

Yes, the test checked for 0 length but not with a preceding 0.  I 
posted the fix.

   if (c is null || *c==0)

Jun 28 2016

qznc <qznc web.de> writes:

On Sunday, 26 June 2016 at 16:40:08 UTC, Jay Norwood wrote:
 After watching Andre's sentinel thing, I'm playing with strlen 
 on char strings with 4 terminating 0s instead of a single one.  
 Seems to work and is 4x faster compared to the runtime version.

 nothrow pure size_t strlen2(const(char)* c) {
  if (c is null)
    return 0;
  size_t l=0;
  while (*c){ c+=4; l+=4;}
  while (*c==0){ c--; l--;}
  return l+1;
 }

 This is the timing of my test case, which I can post if anyone 
 is interested.
 strlen\Release>strlen
 2738
 681

Did you also compare to strlen from libc? I'd guess GNU libc uses 
a lot more tricks like vector instructions.

Jun 28 2016

Jay Norwood <jayn prismnet.com> writes:

On Tuesday, 28 June 2016 at 09:18:34 UTC, qznc wrote:
 Did you also compare to strlen from libc? I'd guess GNU libc 
 uses a lot more tricks like vector instructions.

I did test with the libc strlen, although the D libraries did not 
have a strlen for dchar or wchar.  I'm currently using this for 
comparison, and playing around with shorter string lengths:

nothrow pure size_t strlen(const(char)* c) {
   if (c is null )
     return 0;
   const(char)* c_save = c;
   while (*c) {
     c++;
   }
   return c - c_save;
}

I'm also trying some tests on a PA device where I have tools to 
look at cache hits, misses, branches mispredicted.  Similar C 
code.

Jun 28 2016

D Programming

C/C++ Programming

Other

digitalmars.D.announce - 4x faster strlen with 4 char sentinel