www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 16517] New: topN performance very poor on certain data sets


          Issue ID: 16517
           Summary: topN performance very poor on certain data sets
           Product: D
           Version: D2
          Hardware: x86
                OS: Mac OS X
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody puremagic.com
          Reporter: jrdemail2000-dlang yahoo.com

topN has very poor performance on certain data sets, to the point of being

Note: There was a forum thread, issues (eg. 15553) and merged patches in Jan
2016. Best I can tell, these fixes are all included in the dmd/ldc releases
I've tested, indicating performance issues remain.

The data set used is not artificial, but the distribution of values are quite
skewed, this may contribute to the performance issues.

Data is a file from the google books ngram viewer project:
https://books.google.com/ngrams. File is a TSV file with 4 columns: ngram,
year, total occurrences, number of books occurred in. File used has 10,512,769
lines. Tests were run by reading one of the fields (10M values), storing in an
array, and running topN to get the median value. This was run for fields 2,3,4
and compared to the same algorithm using sort.

Sort was faster in each case, but for the fields 3 and 4 topN was dramatically
slower, to the point of being unusable.

Summary (Times in milliseconds, Compiler: LDC 1.1.0-beta2):

       Field 2    Field 3   Field 4
sort:      289        285       273
topN:     1756     148793    668906

As this was a median calculation, the nth argument to topN was the mid-point.
Some of the earlier fixes discuss performance issues for topN where nth is at
one end.

Data file:
It's the first file under "1-grams" from this URL:

=====Test program (median.d)=====

import std.stdio;
import std.array : appender;
import std.algorithm : sort, topN;
import std.conv;
import std.range;
import std.datetime: StopWatch;

void main()
    StopWatch swTotal, swSort;
    Appender!(double[]) values;
    foreach (line; stdin.byLine)

    size_t medianIndex = values.data.length/2;
        topN(values.data, medianIndex);
        auto method = "topN";
    else {
        auto method = "sort";

    writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s time
(ms): %d",
             method, values.data[medianIndex], values.data.length,
             method, swSort.peek.msecs);


Test session:

$ make
ldc2 -release -O -boundscheck=off -d-version=topn -ofmedian_by_topn median.d
ldc2 -release -O -boundscheck=off -ofmedian_by_sort median.d

$ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort 
median (via sort): 1972; lines: 10512769; total time (ms): 1756; sort time
(ms): 289
$ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn 
median (via topN): 1972; lines: 10512769; total time (ms): 3199; topN time
(ms): 1756

$ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort 
median (via sort): 3; lines: 10512769; total time (ms): 1549; sort time (ms):
$ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 3; lines: 10512769; total time (ms): 150033; topN time (ms):

$ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort 
median (via sort): 2; lines: 10512769; total time (ms): 1523; sort time (ms):
$ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 2; lines: 10512769; total time (ms): 670124; topN time (ms):

$ ldc2 --version
LDC - the LLVM D compiler (1.1.0-beta2):
  based on DMD v2.071.1 and LLVM 3.9.0
  built with LDC - the LLVM D compiler (1.1.0-beta2)
  Default target: x86_64-apple-darwin14.5.0

Sep 20 2016