www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Less coding

reply bearophile <bearophileHUGS lycos.com> writes:
D is a system language, so probably most of you are quite interested in fast
running programs. I understand that some people here may not appreciate or be
interested in some of the qualities of the scripting languages (as Python and
the like). But I can show a little example. It's an extreme example, uncommon,
and you may think I have cheated a bit, so you don't have to take it too much
seriously, but it looks nice and interesting.

I have seen this C code:
http://nickmudge.info/c/hashtest.c.txt

It tests how many collisions are produced by seven different hash functions. It
doesn't test their running time (and the demo strings are "wrong", because they
are random, while it's better to test them on some real text), but for me it
was interesting enough to do a D translation. 
It generates an array of distinct random strings of random length, computes
hash values for them, and counts how many repeated results there are. It
contains functions like:

linear_array_search
check_key_duplicates
get_random_character
make_random_strings

The original code tests only hash7, but I have added tests for all of them.
Those four functions, plus the main() require about 80 lines (almost 95 if you
don't use function pointers to shorten the code that calls the various hashing
functions).

Using my d libs I have translated those four functions plus the main like this:

void main() {
    fastSeed = 10;

    string chars = letters ~ digits;
    int[string] tests;
    do {
        auto txt = map((int i){ return fastChoice(chars);},
xrange(fastRandInt(1, 24)));
        tests[txt] = 1;
    } while (tests.length < 10000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7]) {
        auto keys = fromKeys(xmap(func, tests), 0);
        putr("hash", nf+2, ": ", tests.length - keys.length);
    }
}


Explanations of the code:
- map() takes a function, and an one or more iterable(s), and it return an
array of the function applied to every item of the iterable. If the iterable
has a length attribute, it pre-allocates the result, so with a bit of inlining
it's often almost as fast as a normal loop.
- xrange() return an iterable object, it just returns successive numbers in an
interval, with a step too.
- randInt is fast enough, and it gives an integer in the closed integer
interval. The seed is set to the current time() by default, but you may set the
seed too.
- xmap() is like map, but like all the x-something it returns a lazy iterable
object, that yields the items mapped by the given function.
- fromKeys() takes an iterable and a constant, and it returns an associative
array with constant keys (it's similar to the useful fromkeys() method of
python dicts. I hope D AAs will grow some useful methods).

The following code is for an unreleased yet version of my d libs, I am writing
a set()/Set(), that is a class (and function helper) that just wraps D AAs to
give a useful set semantics with many simple but useful methods:

void main() {
    string chars = letters ~ digits;
    auto tests = Set!(int);
    do {
        auto txt = map((int i){ return choice(chars);}, xrange(randInt(1, 24)));
        tests.add(txt);
    } while (tests.length < 50000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7])
        putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length);
}

Notes:
- On my PC with DMD (the version of the code without set()) the creation of the
tests[] requires about 1 second with n=50_000. I have seen that using a lower
level syle of coding it can be lowered to about 0.5 seconds (in most cases the
speed loss is lower than 50%), but it's just initialization time, it runs only
once, so I don't care much for for the extra half second. Losing that half
second I gain a shorter code, that's faster to write and debug.
- The original C version uses linear scans, instead of the D hashes, so it's
much slower, on my PC the D version is 15 times faster than the C one, with N =
10000, compiled with MinGW 4.2.1 (-O3) and the latest DMD.
- The faster Python version (with the Psyco JIT and array.array) needs about
3.5 seconds to generate the test data alone, with n=50000 still.
- I have removed the hash1 from all tests because it was too much slow for the
C version.
- If I replace choice(chars) with choice(letters ~ digits) the code is becomes
4-5 times slower, because DMD 1.026 doesn't join those two constant strings
(dynamic arrays) at compile/initialization time.
- I know that that C code runs everywhere, and it doesn't need external libs,
while my code needs a lib, and I know you can write good libs in C too, that
can help you shorten & simplify you code a lot. And I know you can write hashes
in C too, or you can use the GNU data structures already written.
- The running time for the hash functions is lower for the C version, because
GCC optimizes that code better.

Bye,
bearophile
Feb 25 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Using the set it can written like this too:

void main() {
    string chars = letters ~ digits;
    auto tests = Set!(int);
    do {
        tests ~= map((int i){ return choice(chars);}, xrange(randInt(1, 24)));
    } while (tests.length < 50000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7])
        putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length);
}

Bye,
bearophile
Feb 25 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Sorry, time to go to sleep!
That set is of strings:
auto tests = Set!(string);
Feb 25 2008
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Using the set it can written like this too:

void main() {
    string chars = letters ~ digits;
    auto tests = Set!(int);
    do {
        tests ~= map((int i){ return choice(chars);}, xrange(randInt(1, 24)));
    } while (tests.length < 50000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7])
        putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length);
}

Bye,
bearophile
Feb 25 2008
prev sibling parent reply Jarrod <qwerty ytre.wq> writes:
I like your collection of higher order functions, perhaps you can get 
them added to the core library in the future, or even better, Walter can 
build them into the language (ha).
If anything, D really needs a 'range'. Something like:
int[] arr = [0..10]; //arr would contain the digits 0-9

initialization syntax, and looks similar to using an array slice. Then with that I could experience what I think would be happiness as I used:
foreach(i; [0..10]){...} 

for (int i = 0; i < 10; ++i){...}

that deal with it so much better (perl, python, php, ruby, haskell, lisp, even visual basic) It would also be nice to loop just a certain amount of times without needing the counter:
foreach([1..10]){...} 

Ahem, it seems I've sort of gone off on a tangent here. But yeah, HOP and functional stuff is awesome and would be nice to see being added to D, even if only as part of the library.
Feb 27 2008
parent reply Jarrod <qwerty ytre.wq> writes:
On Wed, 27 Feb 2008 11:01:57 +0000, Jarrod wrote:
stuff

Oh my, I should look at 2.0 more often. std.algorithm is packed full of HOP functions, and a foreach range. I probably should have looked before posting, this is what I get for not keeping track of 2.0 :( Still, I have one valid point left about arrays being initialized with a range, but basically you can ignore the above post =/
Feb 27 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Jarrod:

 I like your collection of higher order functions, perhaps you can get
 them added to the core library in the future, or even better, Walter can
 build them into the language (ha).

Thank you. I think most of that stuff doesn't need to become built-in (I think among them only two things may enjoy being supported by the compiler: an *exact* copy of Python list comps and generators. Partial or approximate copies are not worth). What I'd like to see in the compiler isn't more HOFs, but better basic tools to *build* and use such high-order things. AST macros may help. Other useful tools: - Built-in (or in phobos, if it can be used in a transparent way) ability to iterate on two generators in parallel. - Better type inference, to allow shorter anonymous function syntax and allow more automatic specialization: for example having defined many different foo() templated functions, I'd like to use it as &foo, instead of &foo!(uint, long). - A more logically sound module system (*).
 std.algorithm is packed full of HOP functions, and a foreach range.

D 2.x allows foreach(i; 1..10) too, but I am using D 1.x, and most (95%?) of my stuff isn't present in that std.algorithm. It's nothing revolutionary, but they seem to work well enough, and I have kept the usage very easy because they can be useful (almost) only if they are easy to remember :-) =============================================== (*) beside the other bugs/problems shown in the past by me regarding the module system, I may have found another problem in it. The following ones are 3 modules, the first is the main one: module main_module; import foo1; import foo2; void main() { str(1); } ------------------ module foo1; private import std.string: str = toString; ------------------ module foo2; string str(TyArgs...)(TyArgs args) { return ""; } ------------------ If compiled it produces: main_module.d(4): Error: foo1.str at foo1.d(2) conflicts with foo2.str(TyArgs...) at foo2.d(2) The "private" in foo1 must be unnecessary in a usable module system, but even adding it the "str" name in the "foo1" namespace becomes imported anyway in the "main_module" namespace, and it gives conflict. Bye, bearophile
Feb 27 2008