digitalmars.D - D vs Haskell

Szymon Gatner (2/2) Jun 22 2013 Word counting problem in D and Haskell:

Juan Manuel Cabo (84/87) Jun 22 2013 I thought that the D time could be improved further with

Joseph Rushton Wakeling (2/11) Jun 22 2013 What about with GDC or LDC?

Juan Manuel Cabo (13/26) Jun 22 2013 Right, the author of the article used ldc. I'm always used

Joseph Rushton Wakeling (5/14) Jun 23 2013 You know that you can use ldmd2 to invoke LDC using the same

Juan Manuel Cabo (15/36) Jun 23 2013 The computer where I tested has Kubuntu 12.04, which is

"Szymon Gatner" <noemail gmail.com> writes:

Word counting problem in D and Haskell:

http://leonardo-m.livejournal.com/109201.html

Jun 22 2013

Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:

On 06/22/2013 03:25 PM, Szymon Gatner wrote:
 Word counting problem in D and Haskell:
 
 http://leonardo-m.livejournal.com/109201.html

I thought that the D time could be improved further with
little changes.

Testing against Complete Works of William Shakespeare (5.3 MiB
plaintext): http://www.gutenberg.org/ebooks/100
and using "dmd -O -inline -noboundscheck -release" on both
the last version of the author, and my version using "byWord"
I got these times (minimum of 10 runs):

    before: 2781 ms

    after:   805 ms


Here is the code, with a "byWord" range using std.ascii.toAlpha:


    import std.stdio, std.conv, std.file, std.string,
           std.algorithm, std.range, std.traits, std.ascii;


    auto hashCounter(R)(R items) if (isForwardRange!R) {
        size_t[ForeachType!R] result;
        foreach (x; items)
            result[x]++;
        return result.byKey.zip(result.byValue);
    }

    void main(string[] args) {
        //Slow:
        //      args[1]
        //      .readText
        //      .toLower
        //      .tr("A-Za-z", "\n", "cs")
        //      .split

        //Faster:
        args[1]
        .readText
        .byWord
        .map!toLower()
        .array
        .hashCounter
        .array
        .sort!"-a[1] < -b[1]"()
        .take(args[2].to!uint)
        .map!q{ text(a[0], " ", a[1]) }
        .join("\n")
        .writeln;
    }

    /** Range that extracts words from a string. Words are
        strings composed only of chars accepted by std.ascii.toAlpha() */
    struct byWord {
        string s;
        size_t pos;
        string word;

        this(string s) {
            this.s = s;
            popFront();
        }

         property bool empty() const {
            return s.length == 0;
        }

         property string front() {
            return word;
        }

        void popFront() {
            if (pos == s.length) {
                //Mark the range as empty, only after popFront fails:
                s = null;
                return;
            }

            while (pos < s.length && !std.ascii.isAlpha(s[pos])) {
                ++pos;
            }
            auto start = pos;
            while (pos < s.length && std.ascii.isAlpha(s[pos])) {
                ++pos;
            }

            if (start == s.length) {
                //No more words. Range empty:
                s = null;
            } else {
                word = s[start .. pos];
            }
        }
    }

    unittest {
        assert([] == array(byWord("")));

        assert(["a", "b"] == array(byWord("a b")));
        assert(["a", "b", "c"] == array(byWord("a b c")));
    }


--jm

Jun 22 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/22/2013 10:11 PM, Juan Manuel Cabo wrote:
 Testing against Complete Works of William Shakespeare (5.3 MiB
 plaintext): http://www.gutenberg.org/ebooks/100
 and using "dmd -O -inline -noboundscheck -release" on both
 the last version of the author, and my version using "byWord"
 I got these times (minimum of 10 runs):
 
     before: 2781 ms
 
     after:   805 ms

What about with GDC or LDC?

Jun 22 2013

Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:

On 06/22/2013 06:29 PM, Joseph Rushton Wakeling wrote:
 On 06/22/2013 10:11 PM, Juan Manuel Cabo wrote:
 Testing against Complete Works of William Shakespeare (5.3 MiB
 plaintext): http://www.gutenberg.org/ebooks/100
 and using "dmd -O -inline -noboundscheck -release" on both
 the last version of the author, and my version using "byWord"
 I got these times (minimum of 10 runs):

     before: 2781 ms

     after:   805 ms

 
 What about with GDC or LDC?
 

Right, the author of the article used ldc. I'm always used
to dmd.
Keep in mind that he hasn't released the 9MB text that he
used, but instead pointed to that 5.3MB Shakespeare file.

This time I used LDC 0.11.0, with:

     ldc2 -O -release -disable-boundscheck

Times (minimum of 10 runs):

     before: 1617 ms

     after:  554 ms

Conclusion: using "byWords" was 3.4 times faster with DMD
and 2.9 times faster with LDC.

--jm

Jun 22 2013

"Joseph Rushton Wakeling" <joseph.wakeling webdrake.net> writes:

On Saturday, 22 June 2013 at 21:45:48 UTC, Juan Manuel Cabo wrote:
 Right, the author of the article used ldc. I'm always used
 to dmd.

You know that you can use ldmd2 to invoke LDC using the same 
flags as DMD?

 This time I used LDC 0.11.0, with:

      ldc2 -O -release -disable-boundscheck

 Times (minimum of 10 runs):

      before: 1617 ms

      after:  554 ms

 Conclusion: using "byWords" was 3.4 times faster with DMD
 and 2.9 times faster with LDC.


Any chance of GDC results? You can again use gdmd for DMD-like 
interface.

Jun 23 2013

Juan Manuel Cabo <juanmanuel.cabo gmail.com> writes:

On 06/23/2013 05:42 AM, Joseph Rushton Wakeling wrote:
 On Saturday, 22 June 2013 at 21:45:48 UTC, Juan Manuel Cabo wrote:
 Right, the author of the article used ldc. I'm always used
 to dmd.

 
 You know that you can use ldmd2 to invoke LDC using the same flags as DMD?
 
 This time I used LDC 0.11.0, with:

      ldc2 -O -release -disable-boundscheck

 Times (minimum of 10 runs):

      before: 1617 ms

      after:  554 ms

 Conclusion: using "byWords" was 3.4 times faster with DMD
 and 2.9 times faster with LDC.

 
 
 Any chance of GDC results? You can again use gdmd for DMD-like interface.

The computer where I tested has Kubuntu 12.04, which is
too old even for gcc-4.7, and I cannot compile GDC there.
   The gdc version which comes in the repository fails
to compile the program, even after some rearranging so
that it doesn't use UFCS.

Anyway, using my "byWord" range to iterate the string
will surely be faster with GDC too, because tr and split can
be a little slow, especially since the string is big (5MB).

I tested a little more and found that using a handcoded
function instead of tr, and using splitter instead of split,
it's almost as fast as the byWord version. The handcoded
function assumes ascii and preallocates the resulting string
(with result.length = source.length).

--jm

Jun 23 2013

D Programming

C/C++ Programming

Other

digitalmars.D - D vs Haskell