digitalmars.D.bugs - [Issue 16517] New: topN performance very poor on certain data sets
- via Digitalmars-d-bugs (101/101) Sep 20 2016 https://issues.dlang.org/show_bug.cgi?id=16517
https://issues.dlang.org/show_bug.cgi?id=16517 Issue ID: 16517 Summary: topN performance very poor on certain data sets Product: D Version: D2 Hardware: x86 OS: Mac OS X Status: NEW Severity: enhancement Priority: P1 Component: phobos Assignee: nobody puremagic.com Reporter: jrdemail2000-dlang yahoo.com topN has very poor performance on certain data sets, to the point of being unusable. Note: There was a forum thread, issues (eg. 15553) and merged patches in Jan 2016. Best I can tell, these fixes are all included in the dmd/ldc releases I've tested, indicating performance issues remain. The data set used is not artificial, but the distribution of values are quite skewed, this may contribute to the performance issues. Data is a file from the google books ngram viewer project: https://books.google.com/ngrams. File is a TSV file with 4 columns: ngram, year, total occurrences, number of books occurred in. File used has 10,512,769 lines. Tests were run by reading one of the fields (10M values), storing in an array, and running topN to get the median value. This was run for fields 2,3,4 and compared to the same algorithm using sort. Sort was faster in each case, but for the fields 3 and 4 topN was dramatically slower, to the point of being unusable. Summary (Times in milliseconds, Compiler: LDC 1.1.0-beta2): Field 2 Field 3 Field 4 sort: 289 285 273 topN: 1756 148793 668906 As this was a median calculation, the nth argument to topN was the mid-point. Some of the earlier fixes discuss performance issues for topN where nth is at one end. Data file: http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-0.gz It's the first file under "1-grams" from this URL: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html =====Test program (median.d)===== import std.stdio; import std.array : appender; import std.algorithm : sort, topN; import std.conv; import std.range; import std.datetime: StopWatch; void main() { StopWatch swTotal, swSort; swTotal.start; Appender!(double[]) values; foreach (line; stdin.byLine) values.put(line.to!double); size_t medianIndex = values.data.length/2; swSort.start; version(topn) { topN(values.data, medianIndex); auto method = "topN"; } else { sort(values.data); auto method = "sort"; } swTotal.stop; swSort.stop; writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s time (ms): %d", method, values.data[medianIndex], values.data.length, swTotal.peek.msecs, method, swSort.peek.msecs); } ========================== Test session: $ make ldc2 -release -O -boundscheck=off -d-version=topn -ofmedian_by_topn median.d ldc2 -release -O -boundscheck=off -ofmedian_by_sort median.d $ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 1972; lines: 10512769; total time (ms): 1756; sort time (ms): 289 $ gcut -f 2 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 1972; lines: 10512769; total time (ms): 3199; topN time (ms): 1756 $ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 3; lines: 10512769; total time (ms): 1549; sort time (ms): 285 $ gcut -f 3 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 3; lines: 10512769; total time (ms): 150033; topN time (ms): 148793 $ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort median (via sort): 2; lines: 10512769; total time (ms): 1523; sort time (ms): 273 $ gcut -f 4 googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn median (via topN): 2; lines: 10512769; total time (ms): 670124; topN time (ms): 668906 $ ldc2 --version LDC - the LLVM D compiler (1.1.0-beta2): based on DMD v2.071.1 and LLVM 3.9.0 built with LDC - the LLVM D compiler (1.1.0-beta2) Default target: x86_64-apple-darwin14.5.0 --
Sep 20 2016