digitalmars.D - Multilevel language

bearophile (114/114) Mar 17 2008 In this post I show what I meant when I have said "D is a multi-level la...

bearophile (5/5) Mar 17 2008 The "big3.txt" is just a large block of texts in English, it comes from ...

bearophile <bearophileHUGS lycos.com> writes:

In this post I show what I meant when I have said "D is a multi-level
language". First, a low-level example. Few days ago I have written generators
for the audioactive sequence 
http://www.fantascienza.net/leonardo/js/audioactive.zip

See for information:
http://en.wikipedia.org/wiki/Look-and-say_sequence
http://www.research.att.com/~njas/sequences/A005150

One of my faster versions uses one of the lower and faster possible algorithms,
that is simulating a finite state machine, this is the core of the program:

S0:
    if (p1 == pend) goto END;
    switch(*p1++) {
        case '1': goto S1;
        case '2': goto S2;
        case '3': goto S3;
    }

S1:
    if (p1 == pend) { *p2++ = '1'; *p2++ = '1'; goto END; }
    switch (*p1++) {
        case '1': goto S11;
        case '2': *p2++ = '1'; *p2++ = '1'; goto S2;
        case '3': *p2++ = '1'; *p2++ = '1'; goto S3;
    }

S2:
    if (p1 == pend) { *p2++ = '1'; *p2++ = '2'; goto END; }
    switch (*p1++) {
        case '1': *p2++ = '1'; *p2++ = '2'; goto S1;
        case '2': goto S22;
        case '3': *p2++ = '1'; *p2++ = '2'; goto S3;
    }


Using std.intrinsic you can access single bits, so you can encode the 3 digits
with an average of just 1.5 bits (their frequency isn't equal), this code is
slower but allows to generate longer terms of the Look-and-say sequence (up to
76th on a 256 MB RAM PC, in 4-5 minutes):
 
S0:
    if (p1 == pend) goto END;
    if (testbit(s1, p1++)) {
        if (testbit(s1, p1++)) {
            goto S3;
        } else {
            goto S2;
        }
    } else {
        goto S1;
    }

S1:
    if (p1 == pend) { writezero(s2, p2++); writezero(s2, p2++); goto END; }
    if (testbit(s1, p1++)) {
        if (testbit(s1, p1++)) {
            writezero(s2, p2++); writezero(s2, p2++); goto S3;
        } else {
            writezero(s2, p2++); writezero(s2, p2++); goto S2;
        }
    } else {
        goto S11;
    }
...


(Note that using similar code structure and the Psyco JIT you can create a
rather fast version in Python too, only 3 times slower than the faster C
version).


And now an example of high-level style coding. The following is related to the
article "How to Write a Spelling Corrector" by the famous Peter Norvig:
http://norvig.com/spell-correct.html

Python version (21 lines, not counting the caller last one):


import re, collections
from string import ascii_lowercase as alphabet

words = lambda text: re.findall(r'[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

nwords = train(words(file('big3.txt').read()))

def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in xrange(n)] +                    

               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in xrange(n-1)] +

               [word[0:i]+c+word[i+1:] for i in xrange(n) for c in alphabet] +

               [word[0:i]+c+word[i:] for i in xrange(n+1) for c in alphabet]) 


def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

known = lambda words: set(w for w in words if w in nwords)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or
[word]
    return max(candidates, key=lambda w: nwords[w] if w in nwords else -1)

print correct('speling'), "\n", correct('korrecter')



D version using my libs (23 lines, not counting the caller last one):
www.fantascienza.net/leonardo/so/libs_d.zip


import std.string, std.regexp, std.file, d.all;

int[string] nwords;
static this() {
    foreach (w; RegExp("[a-z]+",
"g").match(inToLower(cast(string)read("big3.txt"))))
        nwords[w]++;
}

Set!(string) edits1(string w) {
    int i, n = len(w);
    char c;
    return set(list(w[0..i] ~ w[i+1..$], i, xrange(n)) ~                     //
deletion
               list(w[0..i] ~ w[i+1] ~ w[i] ~ w[i+2..$], i, xrange(n-1)) ~   //
transposition
               list(w[0..i] ~ c ~ w[i+1..$], i, xrange(n), c, lowercase[]) ~ //
alteration
               list(w[0..i] ~ c ~ w[i..$], i, xrange(n+1), c, lowercase[])); //
insertion
}

Set!(string) knownEdits2(string word) {
    string s1, s2;
    return set(list(s2, s1, edits1(word), s2, edits1(s1), s2 in nwords));
}

Set!(string) known(T)(T words) { return set(words) & nwords; }

string correct(string w) {
    auto found = lazyOr(known([w]), known(edits1(w)), knownEdits2(w), set([w]));
    return max(found, (string wo){ return wo in nwords ? nwords[wo] : -1; });
}

void main() { putr( correct("speling"), \n, correct("korrecter")); }



This D program has found one performance bug (already fixed) in the D garbage
collector(s), one possible bug using AAs with the ?: operator (see a thread in
d.learn group), has found a bug in my libs, has made me add that lazyOr
function to my libs, and it was rather fun to write :-)

I have created lazyO to write this program (but it will be useful elsewhere,
it's similar to the Python or operator), if you don't want to use lazyOr() you
can define just this line:
T or(T)(T x, lazy T y) { return isTrue(x) ? x : y(); }
That you can use like this:
auto found = or(or(or(known([w]), known(edits1(w))), knownEdits2(w)), set([w]));


That "g" in the Phobos REgex API is ugly, that API needs some cleaning and
improviments. Maybe PCRE can just be plug into Phobos (with an improved API)
intead of the current regex engine, this will allow to remove most regex bugs,
improve its speed, work with a very standard re syntax, etc.

Note that Python program has corner-case bugs, is fragile, it's quite slow,
it's not a good example of programming, etc. Much better Python programs can be
written for this purpose. Probably its only good side is to be short, so it can
show in a short space the idea behind a spelling corrector. The D code shares
most of the same problems. Such high-level style coding in D can be useful to
weite prototypes too, that later if necessary can be translated to faster
lower-level style D (or even assembly) code.

With the patch to the GC the D version runs a little faster than the Python
version (about 7.5 seconds on my very old PC, probably a little more than 1.5
seconds on a more modern PC).

Bye,
bearophile

Mar 17 2008

bearophile <bearophileHUGS lycos.com> writes:

The "big3.txt" is just a large block of texts in English, it comes from this
version:
http://norvig.com/big.txt
But I have removed non-ASCII chars (> 127).

Bye,
bearophile

Mar 17 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Multilevel language