www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - New UTF-8 stride function

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
If there is anything that come out of UTF-8 discussion is that I decided 
to dust off my experimental implementation of UTF-8 stride function. 
Just for fun.

The key difference vs std is in handling non-ASCII case.
I'm replacing bsr intrinsic with a what I call an "in-register lookup 
table" (neat stuff that is a piece of cake, thx to CTFE).

See unittest/benchmark here:
https://gist.github.com/blackwhale/5653927

I'm running tests against wiki titles dumps.

For me the results are mixed but in 2 of 3 _builds_ my version 
consistently wins (sometimes by as much as 50%).

1. build is win32 dmd with -release -O -inline  -noboundscheck
	my version is consistently faster (results below)
2. On linux x64 with the same config
	my version is consistently slower

3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck
	my version is again faster with large margin

It's the kind of thing that is tremendously hard to measure accurately 
since it depends on the workload, architecture and the time spent is 
very small. So don't take it by word I'm almost certain that something 
is amiss (compiler switches and whatnot).

Thus I encourage curious folks to measure/analyze it and report back 
(don't forget to include your processor model).

The unbeatable advantage is however that my version doesn't require 
bsr/lzcount instruction :) BTW do ARM/PowerPC have any analog of it?

Test files I used:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0
https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0

Some dumps of my test runs

win32 runs (time taken in usec):

fast_stride ruwiki-latest-all-titles-in-ns0
stride 313756
myStride 229650
myStride 235091
stride 312563

fast_stride enwiki-latest-all-titles-in-ns0
stride 346577
myStride 279915
myStride 278684
stride 348902

fast_stride enwiki-latest-all-titles-in-ns0
stride 345866
myStride 280902
myStride 279780
stride 345653

fast_stride arwiki-latest-all-titles-in-ns0
stride 46548
myStride 33840
myStride 34959
stride 46342

fast_stride dewiki-latest-all-titles-in-ns0
stride 79715
myStride 64719
myStride 64672
stride 79848

dmd linux 64 runs

./fast_stride enwiki-latest-all-titles-in-ns0
stride 377258
myStride 630367
myStride 633262
stride 378523

./fast_stride arwiki-latest-all-titles-in-ns0
stride 33924
myStride 38807
myStride 47708
stride 40160

./fast_stride arwiki-latest-all-titles-in-ns0
stride 35110
myStride 39750
myStride 49942
stride 33597


-- 
Dmitry Olshansky
May 26 2013
next sibling parent reply "Kiith-Sa" <kiithsacmp gmail.com> writes:
WRT to the worse Linux64 case:
I recommend infinite-cycling it and testing in perf top.

(If you're on Ubuntu/derivative or maybe Debian, just type "perf 
top",
  it will tell you what package to install, and once installed, 
"perf top" again, while the benchmark is running)

You'll get a precise real-time line-wise (with ability to drill 
down to ASM) profile (like "top", but for functions).

With some command-line options (google "linux perf"), you can 
also look
at cache misses, branch mispredictions, and so on. Compare that 
with the original version and you might find why it's slower.

(Don't have time to test anything right now)


On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:
 If there is anything that come out of UTF-8 discussion is that 
 I decided to dust off my experimental implementation of UTF-8 
 stride function. Just for fun.

 The key difference vs std is in handling non-ASCII case.
 I'm replacing bsr intrinsic with a what I call an "in-register 
 lookup table" (neat stuff that is a piece of cake, thx to CTFE).

 See unittest/benchmark here:
 https://gist.github.com/blackwhale/5653927

 I'm running tests against wiki titles dumps.

 For me the results are mixed but in 2 of 3 _builds_ my version 
 consistently wins (sometimes by as much as 50%).

 1. build is win32 dmd with -release -O -inline  -noboundscheck
 	my version is consistently faster (results below)
 2. On linux x64 with the same config
 	my version is consistently slower

 3. LDC on linux x64 with ldc2 -O3 -d-noboundscheck
 	my version is again faster with large margin

 It's the kind of thing that is tremendously hard to measure 
 accurately since it depends on the workload, architecture and 
 the time spent is very small. So don't take it by word I'm 
 almost certain that something is amiss (compiler switches and 
 whatnot).

 Thus I encourage curious folks to measure/analyze it and report 
 back (don't forget to include your processor model).

 The unbeatable advantage is however that my version doesn't 
 require bsr/lzcount instruction :) BTW do ARM/PowerPC have any 
 analog of it?

 Test files I used:
 https://github.com/blackwhale/gsoc-bench-2012/blob/master/arwiki-latest-all-titles-in-ns0
 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
 https://github.com/blackwhale/gsoc-bench-2012/blob/master/dewiki-latest-all-titles-in-ns0
 https://github.com/blackwhale/gsoc-bench-2012/blob/master/ruwiki-latest-all-titles-in-ns0

 Some dumps of my test runs

 win32 runs (time taken in usec):

 fast_stride ruwiki-latest-all-titles-in-ns0
 stride 313756
 myStride 229650
 myStride 235091
 stride 312563

 fast_stride enwiki-latest-all-titles-in-ns0
 stride 346577
 myStride 279915
 myStride 278684
 stride 348902

 fast_stride enwiki-latest-all-titles-in-ns0
 stride 345866
 myStride 280902
 myStride 279780
 stride 345653

 fast_stride arwiki-latest-all-titles-in-ns0
 stride 46548
 myStride 33840
 myStride 34959
 stride 46342

 fast_stride dewiki-latest-all-titles-in-ns0
 stride 79715
 myStride 64719
 myStride 64672
 stride 79848

 dmd linux 64 runs

 ./fast_stride enwiki-latest-all-titles-in-ns0
 stride 377258
 myStride 630367
 myStride 633262
 stride 378523

 ./fast_stride arwiki-latest-all-titles-in-ns0
 stride 33924
 myStride 38807
 myStride 47708
 stride 40160

 ./fast_stride arwiki-latest-all-titles-in-ns0
 stride 35110
 myStride 39750
 myStride 49942
 stride 33597

May 26 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-May-2013 01:04, Kiith-Sa пишет:
 WRT to the worse Linux64 case:
 I recommend infinite-cycling it and testing in perf top.

 (If you're on Ubuntu/derivative or maybe Debian, just type "perf top",
   it will tell you what package to install, and once installed, "perf
 top" again, while the benchmark is running)

 You'll get a precise real-time line-wise (with ability to drill down to
 ASM) profile (like "top", but for functions).

 With some command-line options (google "linux perf"), you can also look
 at cache misses, branch mispredictions, and so on. Compare that with the
 original version and you might find why it's slower.

 (Don't have time to test anything right now)

Just tried it. Now I at least see that in 32bit my version is faster, whereas on 64bit it isn't (that is on DMD). One curiosity is that the code for ASCII case is the same yet even on English text the difference is about the same. Another one is that both function are not even partially inlined. -- Dmitry Olshansky
May 27 2013
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:
 It's the kind of thing that is tremendously hard to measure 
 accurately since it depends on the workload, architecture and 
 the time spent is very small. So don't take it by word I'm 
 almost certain that something is amiss (compiler switches and 
 whatnot).

For such cases, I found Agner's benchmarking utilities to be very useful. They print exact CPU statistics, such as numbers of micro-ops, cache misses, mispredicted branches, etc. I've used them very successfully when tuning my appender implementation. To use them with D, I modified his C++ program to load a DLL and call a function, taking the DLL and function names from the command line. Original program: http://www.agner.org/optimize/testp.zip My patch (to load a DLL): http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff Usage example (sort of): https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat Hope this helps :)
May 26 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-May-2013 01:13, Vladimir Panteleev пишет:
 On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:
 It's the kind of thing that is tremendously hard to measure accurately
 since it depends on the workload, architecture and the time spent is
 very small. So don't take it by word I'm almost certain that something
 is amiss (compiler switches and whatnot).

For such cases, I found Agner's benchmarking utilities to be very useful. They print exact CPU statistics, such as numbers of micro-ops, cache misses, mispredicted branches, etc. I've used them very successfully when tuning my appender implementation.

Yes, Agner is da man. Just hoped I could postpone this but... welcome to the micro-optimization world I guess.
 To use them with D, I modified his C++ program to load a DLL and call a
 function, taking the DLL and function names from the command line.

 Original program:
 http://www.agner.org/optimize/testp.zip

 My patch (to load a DLL):
 http://dump.thecybershadow.net/5f55e8be5f8cd38ad60f218957ef24bb/PMCTestB.diff


 Usage example (sort of):
 https://github.com/CyberShadow/DAppenderResearch/blob/master/go-dll.bat

 Hope this helps :)

Thanks, I'll try it out. But jezz I have win8 so I'd start with linux version :) -- Dmitry Olshansky
May 27 2013
prev sibling next sibling parent "Kiith-Sa" <kiithsacmp gmail.com> writes:
You don't need any instrumentation for perf and can get similar 
output (even in real time). (Don't intend to start a profiler 
war, but recommend looking at perf before messings with DLLs and 
the like) (although perf is is Linux-only)
I think the Windows version of AMD CodeAnalyst might have similar 
features, again without instrumentation, but never tried it 
(CodeAnalyst is gratis). Intel VTune definitely does, but it's 
not gratis.
May 26 2013
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 26 May 2013 at 21:19:12 UTC, Kiith-Sa wrote:
 You don't need any instrumentation for perf and can get similar 
 output (even in real time). (Don't intend to start a profiler 
 war, but recommend looking at perf before messings with DLLs 
 and the like) (although perf is is Linux-only)
 I think the Windows version of AMD CodeAnalyst might have 
 similar features, again without instrumentation, but never 
 tried it (CodeAnalyst is gratis). Intel VTune definitely does, 
 but it's not gratis.

Benchmarking and profiling are slightly different tasks. I've used AMD CodeAnalyst as well (although you need an AMD processor to use all of its features), but no profiler beats running one batch file and instantly getting a number to compare against.
May 26 2013
prev sibling next sibling parent reply "Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:
On Sunday, 26 May 2013 at 20:49:36 UTC, Dmitry Olshansky wrote:
 [..]
 Thus I encourage curious folks to measure/analyze it and report 
 back (don't forget to include your processor model).
 [..]

Ok, I just tested on my old trusty linux 64bit (Ubuntu 12.04). I had to download those *wiki* files from this url: https://github.com/blackwhale/gsoc-bench-2012/archive/master.zip because github gave an error trying to download the ones that are more than 10Mb. I ran the tests multiple times before copying them here. And here are the results: $ uname -a Linux lolita 3.2.0-43-generic #68-Ubuntu SMP Wed May 15 03:33:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ grep 'model.name\|cpu.MHz' /proc/cpuinfo model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ cpu MHz : 2109.518 $ grep MemTotal /proc/meminfo MemTotal: 4049780 kB $ dmd -O -inline -release -noboundscheck fast_stride.d $ for a in *wiki*; do echo; echo $a: ; ./fast_stride $a; done arwiki-latest-all-titles-in-ns0: stride 67681 myStride 55908 myStride 66026 stride 66328 dewiki-latest-all-titles-in-ns0: stride 155071 myStride 195449 myStride 196586 stride 154627 enwiki-latest-all-titles-in-ns0: stride 688482 myStride 879950 myStride 879451 stride 689087 ruwiki-latest-all-titles-in-ns0: stride 449133 myStride 364808 myStride 512485 stride 448841 --jm
May 26 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-May-2013 01:50, Juan Manuel Cabo пишет:
 And these are the results for the same linux 64bit system but compiling
 with -m32:

This is mostly in agreement with what I have on my 4-core AMD Phenom. About the same on Core i5-3550 at work. Looks like I indeed need 'clock for clock' analysis to draw any conclusion.
 $ dmd -m32 -O -inline -release -noboundscheck fast_stride.d

 $ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done

 arwiki-latest-all-titles-in-ns0:
 stride 89362
 myStride 49974
 myStride 51140
 stride 88308

 dewiki-latest-all-titles-in-ns0:
 stride 138381
 myStride 116971
 myStride 116662
 stride 139681

 enwiki-latest-all-titles-in-ns0:
 stride 584787
 myStride 490681
 myStride 490909
 stride 584694

 ruwiki-latest-all-titles-in-ns0:
 stride 585372
 myStride 333905
 myStride 341274
 stride 585050

 --jm

-- Dmitry Olshansky
May 27 2013
prev sibling next sibling parent "Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:
And these are the results for the same linux 64bit system but 
compiling with -m32:

$ dmd -m32 -O -inline -release -noboundscheck fast_stride.d

$ for a in *wiki*; do echo ; echo $a: ; ./fast_stride $a; done

arwiki-latest-all-titles-in-ns0:
stride 89362
myStride 49974
myStride 51140
stride 88308

dewiki-latest-all-titles-in-ns0:
stride 138381
myStride 116971
myStride 116662
stride 139681

enwiki-latest-all-titles-in-ns0:
stride 584787
myStride 490681
myStride 490909
stride 584694

ruwiki-latest-all-titles-in-ns0:
stride 585372
myStride 333905
myStride 341274
stride 585050

--jm
May 26 2013
prev sibling parent reply Martin Nowak <code dawg.eu> writes:
On 05/26/2013 10:49 PM, Dmitry Olshansky wrote:
 If there is anything that come out of UTF-8 discussion is that I decided
 to dust off my experimental implementation of UTF-8 stride function.
 Just for fun.

 The key difference vs std is in handling non-ASCII case.
 I'm replacing bsr intrinsic with a what I call an "in-register lookup
 table" (neat stuff that is a piece of cake, thx to CTFE).

 See unittest/benchmark here:
 https://gist.github.com/blackwhale/5653927

 Test files I used:
 

 

 

 


bandwith.
May 27 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-May-2013 23:21, Martin Nowak пишет:
 On 05/26/2013 10:49 PM, Dmitry Olshansky wrote:
  > If there is anything that come out of UTF-8 discussion is that I decided
  > to dust off my experimental implementation of UTF-8 stride function.
  > Just for fun.
  >
  > The key difference vs std is in handling non-ASCII case.
  > I'm replacing bsr intrinsic with a what I call an "in-register lookup
  > table" (neat stuff that is a piece of cake, thx to CTFE).
  >
  > See unittest/benchmark here:
  > https://gist.github.com/blackwhale/5653927
  >
 Looks promising.

Cool, I'm not alone in this :) The only definitive results so far is that it takes less cycles on 32 bit. For me AMD CodeAnalyst confirms this is literally in cycles of up to 33% less with smaller samples in a loop. ASCII-only case seems to stay more or less the same (at least cycle-wise but not in time...) saving my sanity.
 These are huge and most likely the performance is limited by the memory
 bandwith.

That could be it. I'll be making measurement on smaller samples of said files and spin on them. More tests to come tomorrow. -- Dmitry Olshansky
May 27 2013
prev sibling parent reply Martin Nowak <code dawg.eu> writes:
On 05/27/2013 09:21 PM, Martin Nowak wrote:
  > See unittest/benchmark here:
  > https://gist.github.com/blackwhale/5653927
  >
 Looks promising.

This will not detect 0xFF as invalid UTF-8 sequence. For sequences with 5 or 6 bytes, that aren't used for unicode, it will return a stride of 4.
May 27 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
28-May-2013 00:42, Martin Nowak пишет:
 On 05/27/2013 09:21 PM, Martin Nowak wrote:
  > See unittest/benchmark here:
  > https://gist.github.com/blackwhale/5653927
  >
 Looks promising.

This will not detect 0xFF as invalid UTF-8 sequence. For sequences with 5 or 6 bytes, that aren't used for unicode, it will return a stride of 4.

First of all there is a minor bug in std.utf in a sense that it accepts sequences of 5 and 6 bytes. They are simply explicitly not defined per Unicode standard and should throw invalid UTF as well. OK I just need to consider the next bit making the whole mask 4bits wide. Thus I need 16 slots in a register. 64bit version will fit just fine in a register 4*16 = 64. 32bit version will have to go with packing 2bits per slot and doing +1 afterwards. Here is an updated version that I'm testing again: https://github.com/blackwhale/gsoc-bench-2012/blob/master/fast_stride.d -- Dmitry Olshansky
May 28 2013