www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D vs Haskell

reply "Szymon Gatner" <noemail gmail.com> writes:
Word counting problem in D and Haskell:

http://leonardo-m.livejournal.com/109201.html
Jun 22 2013
next sibling parent reply Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:
On 06/22/2013 03:25 PM, Szymon Gatner wrote:
 Word counting problem in D and Haskell:
 
 http://leonardo-m.livejournal.com/109201.html

I thought that the D time could be improved further with little changes. Testing against Complete Works of William Shakespeare (5.3 MiB plaintext): http://www.gutenberg.org/ebooks/100 and using "dmd -O -inline -noboundscheck -release" on both the last version of the author, and my version using "byWord" I got these times (minimum of 10 runs): before: 2781 ms after: 805 ms Here is the code, with a "byWord" range using std.ascii.toAlpha: import std.stdio, std.conv, std.file, std.string, std.algorithm, std.range, std.traits, std.ascii; auto hashCounter(R)(R items) if (isForwardRange!R) { size_t[ForeachType!R] result; foreach (x; items) result[x]++; return result.byKey.zip(result.byValue); } void main(string[] args) { //Slow: // args[1] // .readText // .toLower // .tr("A-Za-z", "\n", "cs") // .split //Faster: args[1] .readText .byWord .map!toLower() .array .hashCounter .array .sort!"-a[1] < -b[1]"() .take(args[2].to!uint) .map!q{ text(a[0], " ", a[1]) } .join("\n") .writeln; } /** Range that extracts words from a string. Words are strings composed only of chars accepted by std.ascii.toAlpha() */ struct byWord { string s; size_t pos; string word; this(string s) { this.s = s; popFront(); } property bool empty() const { return s.length == 0; } property string front() { return word; } void popFront() { if (pos == s.length) { //Mark the range as empty, only after popFront fails: s = null; return; } while (pos < s.length && !std.ascii.isAlpha(s[pos])) { ++pos; } auto start = pos; while (pos < s.length && std.ascii.isAlpha(s[pos])) { ++pos; } if (start == s.length) { //No more words. Range empty: s = null; } else { word = s[start .. pos]; } } } unittest { assert([] == array(byWord(""))); assert([] == array(byWord("! #$"))); assert(["a", "b"] == array(byWord("a b"))); assert(["a", "b", "c"] == array(byWord("a b c"))); } --jm
Jun 22 2013
parent reply Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:
On 06/22/2013 06:29 PM, Joseph Rushton Wakeling wrote:
 On 06/22/2013 10:11 PM, Juan Manuel Cabo wrote:
 Testing against Complete Works of William Shakespeare (5.3 MiB
 plaintext): http://www.gutenberg.org/ebooks/100
 and using "dmd -O -inline -noboundscheck -release" on both
 the last version of the author, and my version using "byWord"
 I got these times (minimum of 10 runs):

     before: 2781 ms

     after:   805 ms

What about with GDC or LDC?

Right, the author of the article used ldc. I'm always used to dmd. Keep in mind that he hasn't released the 9MB text that he used, but instead pointed to that 5.3MB Shakespeare file. This time I used LDC 0.11.0, with: ldc2 -O -release -disable-boundscheck Times (minimum of 10 runs): before: 1617 ms after: 554 ms Conclusion: using "byWords" was 3.4 times faster with DMD and 2.9 times faster with LDC. --jm
Jun 22 2013
parent Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:
On 06/23/2013 05:42 AM, Joseph Rushton Wakeling wrote:
 On Saturday, 22 June 2013 at 21:45:48 UTC, Juan Manuel Cabo wrote:
 Right, the author of the article used ldc. I'm always used
 to dmd.

You know that you can use ldmd2 to invoke LDC using the same flags as DMD?
 This time I used LDC 0.11.0, with:

      ldc2 -O -release -disable-boundscheck

 Times (minimum of 10 runs):

      before: 1617 ms

      after:  554 ms

 Conclusion: using "byWords" was 3.4 times faster with DMD
 and 2.9 times faster with LDC.

Any chance of GDC results? You can again use gdmd for DMD-like interface.

The computer where I tested has Kubuntu 12.04, which is too old even for gcc-4.7, and I cannot compile GDC there. The gdc version which comes in the repository fails to compile the program, even after some rearranging so that it doesn't use UFCS. Anyway, using my "byWord" range to iterate the string will surely be faster with GDC too, because tr and split can be a little slow, especially since the string is big (5MB). I tested a little more and found that using a handcoded function instead of tr, and using splitter instead of split, it's almost as fast as the byWord version. The handcoded function assumes ascii and preallocates the resulting string (with result.length = source.length). --jm
Jun 23 2013
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 06/22/2013 10:11 PM, Juan Manuel Cabo wrote:
 Testing against Complete Works of William Shakespeare (5.3 MiB
 plaintext): http://www.gutenberg.org/ebooks/100
 and using "dmd -O -inline -noboundscheck -release" on both
 the last version of the author, and my version using "byWord"
 I got these times (minimum of 10 runs):
 
     before: 2781 ms
 
     after:   805 ms

What about with GDC or LDC?
Jun 22 2013
prev sibling parent "Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:
On Saturday, 22 June 2013 at 21:45:48 UTC, Juan Manuel Cabo wrote:
 Right, the author of the article used ldc. I'm always used
 to dmd.

You know that you can use ldmd2 to invoke LDC using the same flags as DMD?
 This time I used LDC 0.11.0, with:

      ldc2 -O -release -disable-boundscheck

 Times (minimum of 10 runs):

      before: 1617 ms

      after:  554 ms

 Conclusion: using "byWords" was 3.4 times faster with DMD
 and 2.9 times faster with LDC.

Any chance of GDC results? You can again use gdmd for DMD-like interface.
Jun 23 2013