digitalmars.D - [OT] The horizon of a stream

Andrei Alexandrescu (30/30) Oct 23 2008 Consider some text stream. We define the horizon of the stream as the

bearophile (36/36) Oct 23 2008 First try, D1, Phobos:

Andrei Alexandrescu (6/7) Oct 23 2008 [snip]

bearophile (7/10) Oct 23 2008 I understand, but I have 2 GB of RAM, so that solution (in Python) works...

Andrei Alexandrescu (6/27) Oct 23 2008 Now you're talking! This is a solution worth a lot of attention. Under

bearophile (6/7) Oct 23 2008 Assuming that the hash function is good, very similar strings too have d...

bearophile (4/5) Oct 23 2008 There's a third possible solution, that is very fast: assuming the hash ...

bearophile (20/20) Oct 23 2008 This versions compares hash values first, avoiding some slower string co...

bearophile (4/8) Oct 23 2008 Ignore this solution, it's not faster, because the whole string is hashe...

Sergey Gromov (13/47) Oct 23 2008 Traverse the stream, line by line. If the word is new, add it to a
mgen (69/69) Oct 23 2008 If size is a problem use a trie!
BCS (21/21) Oct 23 2008 (going at this blind so I might dup others ideas)
Lionello Lunesu (7/7) Oct 23 2008 With if "corpus" would appear one more time but at a smaller distance?

BCS (2/11) Oct 23 2008 IMNSHO no the max spread would be from 5 to 16

Jason House (9/46) Oct 23 2008 By hash value, store a linked list of (line #, char offset within file)
bearophile (104/105) Oct 24 2008 If the file is "small" my first implementation is good enough (small mea...
Nigel Sandever (29/66) Oct 25 2008 The hash solution is simple and fast, but uses too much memory. So, sub-...

Nigel Sandever (3/5) Oct 25 2008 Should have been 12MB/sec. I was instrumenting in 5 second intervals.
mgen (1/1) Oct 25 2008 Could you post that file somewhere so we could get some even level testi...

Nigel Sandever (5/6) Oct 25 2008 going for these algos?
bearophile (4/5) Oct 25 2008 It looks a bit too much huge. Better to write a little program that gene...

Nigel Sandever (8/16) Oct 25 2008 random data with a suitable distribution. To make it portable across

bearophile (5/8) Oct 25 2008 A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', ...

Nigel Sandever (34/45) Oct 25 2008 rest.

bearophile (9/13) Oct 26 2008 This is a D newsgroup, so use D, it allows you to manage bits more effic...

Nigel Sandever (18/32) Oct 26 2008 Sorry. No disrespect meant to D. I always prototype in Perl and then con...

Christopher Wright (9/46) Nov 01 2008 Simplicity considering external memory constraints.

Andrei Alexandrescu (3/51) Nov 01 2008 I like this solution a lot. Kudos!

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Consider some text stream. We define the horizon of the stream as the 
longest number of lines between two identical lines minus one, or zero 
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit

This is because the word "corpus" occurs on line 5 and line 14. This is 
the farthest two lines in the file are; notice that "ipsum" also occurs 
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a 
potentially large file. Of course theoretical and practical speed are as 
important as always.

This is a task that's interesting and subtle in quite a number of ways, 
and it allows for many creative solutions. Therefore I intentionally did 
not restrict it much. Let ideas flow!


Andrei

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

First try, D1, Phobos:

import std.stdio, std.stream;

void main() {
    int[string] seen;
    int horizon;

    int nline;
    foreach (string line; new BufferedFile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
        nline++;
    }

    writefln(horizon);
}


Idem, with my libs:

import d.all, d.string;

void main() {
    int[string] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
    }

    putr(horizon);
    putr(seen);
}

I'll show an "improved" version soon...

Bye,
bearophile

Oct 23 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 First try, D1, Phobos:

[snip]

At this point a discussion "ok, I'll store every line in a hashtable" 
would have sufficed. I can tell straight away that this is not a 
workable solution for a large file. It's actually happened to me :o|.

Andrei

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 At this point a discussion "ok, I'll store every line in a hashtable" 
 would have sufficed. I can tell straight away that this is not a 
 workable solution for a large file. It's actually happened to me :o|.

I understand, but I have 2 GB of RAM, so that solution (in Python) works most
of the times. When you design an algorithm, you start with the simpler version
(that's also easy to debug and test), and only when you experimentally see it
doesn't suffice (too much slow, not enough RAM, etc) you try to improve it.
When you debug the refined version you can also use the unittests developed
with the basic version, this is very useful.

Another solution is to keep an associative array of just the string hash
values, and lookup into the input file when you have a (possible) false
positive.

Another solution that requires even less RAM is to use a bloom filter (I have
created a bloom filter class in Python but not in D yet) to keep what strings
are seen, and lookup into the input file (as before) to remove (possible) false
positives.

Finally, looking up into the input file every time there's a match is too much
slow, so you just store the candidate matches into a file or in RAM, to test
them all in a second read of the file. If the bloom filter is defined well
enough, then such possible matches aren't too many, and you can keep them in
RAM.

Bye,
bearophile

Oct 23 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 At this point a discussion "ok, I'll store every line in a
 hashtable" would have sufficed. I can tell straight away that this
 is not a workable solution for a large file. It's actually happened
 to me :o|.

 
 I understand, but I have 2 GB of RAM, so that solution (in Python)
 works most of the times. When you design an algorithm, you start with
 the simpler version (that's also easy to debug and test), and only
 when you experimentally see it doesn't suffice (too much slow, not
 enough RAM, etc) you try to improve it. When you debug the refined
 version you can also use the unittests developed with the basic
 version, this is very useful.
 
 Another solution is to keep an associative array of just the string
 hash values, and lookup into the input file when you have a
 (possible) false positive.

Now you're talking! This is a solution worth a lot of attention. Under
what circumstances it does well vs. not so well? How can you ensure it 
works for a stream of any size if multiple passes are possible? etc.

 Another solution that requires even less RAM is to use a bloom filter
 (I have created a bloom filter class in Python but not in D yet) to
 keep what strings are seen, and lookup into the input file (as
 before) to remove (possible) false positives.

Bloom filter is also an excellent idea!


Andrei

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 Under what circumstances it does well vs. not so well?

Assuming that the hash function is good, very similar strings too have
different hash values.

When there are lot of equal lines it doesn't work well, I presume, because
there are lot of equal hash values.

Note that in our example the lines are very short, so replacing them with a
hash_t (as the key of the associative array, so it stores this value two times,
because it keeps the hash of the hash value too, that being probably a size_t
or uint, is hashed to itself) doesn't save you that much RAM.

Bye,
bearophile

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile Wrote:
 Another solution that requires even less RAM is to use a bloom filter...

There's a third possible solution, that is very fast: assuming the hash values
are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires
just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different
value of the hash.

Bye,
bearophile

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

This versions compares hash values first, avoiding some slower string compares,
because in D1 strings don't store their hash value:

import d.all, d.string;

void main() {
    int[Record!(hash_t, string)] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto clean_line = line.chomp();
        auto hash_clean_line = hash(clean_line);
        auto pos_ptr = record(hash_clean_line, clean_line) in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[record(hash_clean_line, clean_line.dup)] = nline;
    }

    putr(horizon);
}

Maybe it's not fully correct yet, so I'll test it some more...

Bye,
bearophile

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 This versions compares hash values first, avoiding some slower
 string compares, because in D1 strings don't store their hash value:
...
         auto pos_ptr = record(hash_clean_line, clean_line) in seen;

Ignore this solution, it's not faster, because the whole string is hashed two
times, etc. You have to use a String class like mine that lazily computes and
stores the hash value too...

Bye,
bearophile

Oct 23 2008

Sergey Gromov <snake.scaly gmail.com> writes:

Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!

Traverse the stream, line by line.  If the word is new, add it to a 
dictionary, size_t[string], with the current line number.  If it's 
already in the dictionary, look at the distance to its first occurence 
and update horizon if that distance is greater than what we already have.  
If we don't run out of memory, we're done in one pass.

If we do run out of memory, recover ;) then continue without adding words 
into the dictionary but dumping them into a temporary file instead, just 
checking and updating the horizon.

The stream ends.  If our current horizon is greater than the number of 
lines we dumped into a temp file, we're done.  Otherwise, empty the 
dictionary, open the temp file, and repeat.

Oct 23 2008

mgen <bmeck stedwards.edu> writes:

If size is a problem use a trie!

//---CODE---
import std.stream;
import std.stdio;

class position_trie
{
bool seen = false;
int last_seen = 0;
position_trie[char] branch;
}

void main()
{
  int[char[]] last;
  int max;
  Stream file = new MemoryStream(
    "lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit"
  );
  auto root = new position_trie();
  position_trie node = root;
  position_trie next;
  foreach(ulong lineno,char[] line;file)
  {
    foreach(int chrno,char chr;line)
    {
      if(chr in node.branch)
      {
        node = node.branch[chr];
      }
      else
      {
        next = new position_trie();
        node.branch[chr] = next;
        node = next;
        foreach(char chr2; line[chrno+1..$])
        {
          next = new position_trie();
          node.branch[chr2] = next;
          node = next;
        }
        break;
      }
    }
    if(node.seen)
    {
      auto dist = lineno - node.last_seen;
      if(dist > max)
        max = dist;
    }
    node.seen = true;
    node.last_seen = lineno;
    node = root;
  }
  writefln(max);
}

Oct 23 2008

BCS <ao pathlink.com> writes:

(going at this blind so I might dup others ideas)

- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash

- look for the hash bucket with the largest spread of line indexes
- grab that spread
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.

The main issue will be the what-to-pick-next-function. It must be fast, not 
requiter to much meta data, and work well with the cache.



Another option:

load every line into a database table with it's index:

INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;

if(c == 0)
  return 0;
else
  return i;

Oct 23 2008

Lionello Lunesu <lio lunesu.remove.com> writes:

With if "corpus" would appear one more time but at a smaller distance?

line 5: corpus
...
line 14: corpus
..
line 16: corpus

Would this stream have a different horizon?

Oct 23 2008

BCS <ao pathlink.com> writes:

Reply to Lionello,

 With if "corpus" would appear one more time but at a smaller distance?
 
 line 5: corpus
 ...
 line 14: corpus
 ..
 line 16: corpus
 Would this stream have a different horizon?
 

IMNSHO no the max spread would be from 5 to 16

Oct 23 2008

Jason House <jason.james.house gmail.com> writes:



When a hash collision occurs, append to the list, but also calculate the
horizon from the head of the list.

While the horizon is higher than previously observed, scan the list until a
string match is found and update the maximum observed horizon.


This performance is linear, but has extra string lookups over some
alternatives. It's possible to optimize the string comparisons with some
kind of temporary caching of strings such as weak pointers.



Andrei Alexandrescu wrote:

 Consider some text stream. We define the horizon of the stream as the
 longest number of lines between two identical lines minus one, or zero
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is
 the farthest two lines in the file are; notice that "ipsum" also occurs
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a
 potentially large file. Of course theoretical and practical speed are as
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways,
 and it allows for many creative solutions. Therefore I intentionally did
 not restrict it much. Let ideas flow!
 
 
 Andrei

Oct 23 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile Wrote:
 There's a third possible solution, that is very fast: assuming the hash values
are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires
just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different
value of the hash.

If the file is "small" my first implementation is good enough (small means the
associative array has to fit in for example 2 GB). For larger files here's the
implementation of the third idea, with D1, Phobos + my libs:

import d.all, std.string, std.c.stdlib, std.outofmemory;

const uint bpc = uint.sizeof * 8;

bool isFlagSet(uint* flags, uint i) {
    int offset = i / bpc;
    uint mask = 1 << (i % bpc);
    return (flags[offset] & mask) != 0;
}

void setFlag(uint* flags, uint i) {
    int offset = i / bpc;
    uint mask = 1 << (i % bpc);
    if ((flags[offset] & mask) == 0)
        flags[offset] |= mask;
}

void main() {
    uint* seen = cast(uint*)calloc(1 << 29, 1);
    uint* seen_again = cast(uint*)calloc(1 << 29, 1);
    if (seen is null || seen_again is null)
        throw new OutOfMemoryException();

    foreach (line; xfile("data.txt")) {
        uint h = hash(line.chomp());
        if (isFlagSet(seen, h))
            // not used std.intrinsics, because they segfault here
            setFlag(seen_again, h);
        else
            setFlag(seen, h);
    }
    free(seen);

    int[string] aa;
    int horizon;
    foreach (nline, line; xfile("data.txt")) {
        auto clean_line = line.chomp();
        if (isFlagSet(seen_again, hash(clean_line))) {
            auto pos_ptr = clean_line in aa;
            if (pos_ptr) {
                if ((nline - *pos_ptr) > horizon)
                     horizon = nline - *pos_ptr;
            } else
                aa[clean_line.dup] = nline;
        }
    }

    putr(horizon);
}


With D1, just Phobos:

import std.stdio, std.string, std.c.stdlib, std.stream, std.outofmemory;

const uint bpc = uint.sizeof * 8;

bool isFlagSet(uint* flags, uint i) {
    int offset = i / bpc;
    uint mask = 1 << (i % bpc);
    return (flags[offset] & mask) != 0;
}

void setFlag(uint* flags, uint i) {
    int offset = i / bpc;
    uint mask = 1 << (i % bpc);
    if ((flags[offset] & mask) == 0)
        flags[offset] |= mask;
}

void main() {
    uint* seen = cast(uint*)calloc(1 << 29, 1);
    uint* seen_again = cast(uint*)calloc(1 << 29, 1);
    if (seen is null || seen_again is null)
        throw new OutOfMemoryException();

    foreach (string line; new BufferedFile("data.txt")) {
        auto clean_line = line.chomp();
        uint h = typeid(string).getHash(&clean_line);
        if (isFlagSet(seen, h))
            // not used std.intrinsics, because they segfault here
            setFlag(seen_again, h);
        else
            setFlag(seen, h);
    }
    free(seen);

    int[string] aa;
    int horizon, nline;
    foreach (string line; new BufferedFile("data.txt")) {
        auto clean_line = line.chomp();
        if (isFlagSet(seen_again, typeid(string).getHash(&clean_line))) {
            auto pos_ptr = clean_line in aa;
            if (pos_ptr) {
                if ((nline - *pos_ptr) > horizon)
                     horizon = nline - *pos_ptr;
            } else
                aa[clean_line.dup] = nline;
        }
        nline++;
    }

    writefln(horizon);
}


The worst case is like this:
A
B
C
D
E
E
D
C
B
A

In this case the aa associative array has to store half of the lines of the
file, and it be too much.

As you can see with this little program I have found another bug in Phobos, the
intrinsics in this case don't work (and I have improved the Record of my libs
too, used in another version of this code).

Bye,
bearophile

Oct 24 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!
 
 
 Andrei

The hash solution is simple and fast, but uses too much memory. So, sub-divide 
your data. 

Obviously, you cannot process the first half and then the second, but you can 
make multiple complete passes, and only consider a subset during each pass.

For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
applicable with  your data), and 'b' on the second pass etc. At the end of each 
pass you retain only the best horizion so far discovered and discard the rest.

Assuming all lower-case and roughly even distribution, you reduce your memory 
requirements to ~4%.

By way of example, I knocked up a data file constisting of 100+ million lines 
(1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got 
my result in just under 25 minutes using < 2MB of memory. 

Admittedly, with this size of file, the entire thing was in the file system 
caches for the second and subsequent passes which saved considerable time. If 
your file is much bigger, that probably wouldn't be the case. But, if you're 
going to write this in D, you can probably speed thing up a bit anyway. 

wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my 
system, where perl only achieved a best of 60mb/sec with a warm cache.

I seem to recall seeing a memory-mapped file module in the D library somewhere. 
Maybe that would reduce the IO for multiple passes?

Also, given the minimal 2MB max. memory usage above, if you know (or can probe) 
your data, then you can reduce the number of passes by doing (say) a-f on the 
first pass, g-l on the second etc. 

Or if all your data begins with (say) "Line ...", then key off the appropriate 
position in the line to form your pass split. Or, if your file is really huge 
use the first two characters for a 676-way split. (And wait :)

Anyway, just a thought from a drive-by reader.

Oct 25 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Sun, 26 Oct 2008 01:00:59 GMT, Nigel Sandever <nigelsandever btconnect.com> 
wrote:

 system, where perl only achieved a best of 60mb/sec with a warm cache.
 

Should have been 12MB/sec. I was instrumenting in 5 second intervals.

Oct 25 2008

mgen <bmeck stedwards.edu> writes:

Could you post that file somewhere so we could get some even level testing
going for these algos?

Oct 25 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Sat, 25 Oct 2008 21:45:29 -0400, mgen <bmeck stedwards.edu> wrote:
 Could you post that file somewhere so we could get some even level testing 

going for these algos?

Sure. If you know somewhere that woudl allow me to upload a (bzip -9) 300Mb 
file?

And are prepared to wait a while, I'm on dial-up :(

Oct 25 2008

bearophile <bearophileHUGS lycos.com> writes:

mgen:
 Could you post that file somewhere so we could get some even level testing
going for these algos?

It looks a bit too much huge. Better to write a little program that generates
random data with a suitable distribution. To make it portable across
Phobos/Tango/etc the RND generator can be included into the code, there are
very fast around.

Bye,
bearophile

Oct 25 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Sat, 25 Oct 2008 22:54:47 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 mgen:
 Could you post that file somewhere so we could get some even level testing 


going for these algos?
 
 It looks a bit too much huge. Better to write a little program that generates 

random data with a suitable distribution. To make it portable across 
Phobos/Tango/etc the RND generator can be included into the code, there are
very 
fast around.
 
 Bye,
 bearophile
 

You'd need the PRNG to produce the same sequence for teh same seed cross 
platform--the MT for example.

And you'd need to distribute (or point to) a common dictionary.

Oct 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Nigel Sandever:
 For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
 applicable with  your data), and 'b' on the second pass etc. At the end of
each 
 pass you retain only the best horizion so far discovered and discard the rest.

A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to
split the input data into bins. But you want to fill those bins uniformly. So
how can do it? The hash number if good is nearly random, so you can partition
its bits to partition your data into more uniform groups :-) The downside is
that you have to compute the hash for each line many times. To speed this up
you can hash just a sample of the chars of each line...

Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM...
You don't need to load the lines as strings, you can load the whole file as a
single memory block in RAM. Then you don't even need slices (8 bytes) to
represent a string, because their ending point is the end of the file or the
successive newline, so you just need the starting points as keys and the line
indexes as values. Unfortunately the built-in AA with the built-in memory
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you
see memory isn't enough to run your program, then you need to use a less
memory-hungry hash map. You can find plenty around, written in C, I have
translated one to D, it's easy. With that if the lines aren't too much short or
there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to
keep the whole associative array, so you need to load the file just one time.

Bye,
bearophile

Oct 25 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 Nigel Sandever:
 For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
 applicable with  your data), and 'b' on the second pass etc. At the end of 


each 
 pass you retain only the best horizion so far discovered and discard the 


rest.
 
 A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,

to split the input data into bins. But you want to fill those bins uniformly.
So 
how can do it? The hash number if good is nearly random, so you can partition 
its bits to partition your data into more uniform groups :-) The downside is 
that you have to compute the hash for each line many times. To speed this up
you 
can hash just a sample of the chars of each line...


I did try that (using md5), but the penalty in Perl was horrible, The hash(AA) 
just had to rehash the binary md5 anyway, so it was pure cost for my known 
dataset. 

For a real application with unknown/non-uniform data it would work like a charm 
though ... assuming you don't hit that rarest of non-impossibilities: duplicate 
md5s of non-identical inputs :)

That said. For most applications of this, one would probably have some feel for 
the dataset(s) one is likely to deal with. And trading domain specific
knowledge 
for total generality is often the best optimisation one can perform.

 
 Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... 

Agreed and ditto. But provided algorithms are written to assume files larger 
than memory, the numbers do scale linearly.

You don't need to load the lines as strings, you can load the whole file as a 
single memory block in RAM. Then you don't even need slices (8 bytes) to 
represent a string, because their ending point is the end of the file or the 
successive newline, so you just need the starting points as keys and the line 
indexes as values. Unfortunately the built-in AA with the built-in memory 
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you
see 
memory isn't enough to run your program, then you need to use a less memory-
hungry hash map. You can find plenty around, written in C, I have translated
one 
to D, it's easy. With that if the lines aren't too much short or there are some 
duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole 
associative array, so you need to load the file just one time.

I agree that a 'testset producer' and a freely accessible known dicionary is a 
better option.

I used (a slightly modified version of) 2of12inf available from 
	http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gz


 
 Bye,
 bearophile

Oct 25 2008

bearophile <bearophileHUGS lycos.com> writes:

Nigel Sandever:

I did try that (using md5), but the penalty in Perl was horrible,<

This is a D newsgroup, so use D, it allows you to manage bits more efficiently.

md5 is too much slow for this purpose, use the built-in hashing of strings.


That said. For most applications of this, one would probably have some feel for
the dataset(s) one is likely to deal with. And trading domain specific
knowledge for total generality is often the best optimisation one can perform.<

I agree, when you have gigabytes of data it's better to gain some knowledge
about the data before process it.


But provided algorithms are written to assume files larger than memory, the
numbers do scale linearly.<

Several of the provided algorithms don't assume files larger than memory, or
they work quite better when the file is smaller than memory.


I used (a slightly modified version of) 2of12inf available from<

That's a quite complex file, so I suggest something simpler, as this after a
cleaning of the non ASCII words:
http://www.norvig.com/big.txt

Bye,
bearophile

Oct 26 2008

Nigel Sandever <nigelsandever btconnect.com> writes:

On Sun, 26 Oct 2008 03:39:50 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 Nigel Sandever:
 
I did try that (using md5), but the penalty in Perl was horrible,<

 
 This is a D newsgroup, so use D, it allows you to manage bits more 

efficiently.
 

Sorry. No disrespect meant to D. I always prototype in Perl and then convert to 
C or D if I need performance. I'm just more familiar with Perl.

 
I used (a slightly modified version of) 2of12inf available from<

 
 That's a quite complex file, so I suggest something simpler, as this after a 

cleaning of the non ASCII words:
 http://www.norvig.com/big.txt

I don't know what is "complex" about a 1 word per line, 81536 line dictionary 
file?

Or how having everyone clean up Conan Doyle would be simpler?

If you have Perl, you can produce a suitable testfile from any 1 word per line 
dictionary with the command line:

    perl -l12n0777aF/\n/ -ne'print $F[rand  F] for 1..4e8' yourdict >thedata

With the 2of12inf dictionary file, 4e8 produces a file a little under 4GB in a 
round 10 minutes. YMWV depending upon the average length of the lines in your 
local dict. 

Of course the won't all be the same as mine, or anyone elses, but given the 
random nature, the results will be broadly comparible.

My D foo is too rusty to try and write that in D. Especially for a D audience :)
I'm sure one of you guys can knock that up in the blink of an eye.


 
 Bye,
 bearophile

Oct 26 2008

Christopher Wright <dhasenan gmail.com> writes:

Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!
 
 
 Andrei

Simplicity considering external memory constraints.
Read in the file, line by line. Output the line followed by the line 
number. Sort the output (merge sort works wonderfully). Now you go 
through it a second time and find the maximum distance.

This takes O(n log n) time, which is far from optimal. However, I think 
we already did sorting in external memory, so I could implement this 
much quicker than most alternatives. Depending on the number of times 
this has to be done, that might be a net advantage.

Nov 01 2008

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Christopher Wright wrote:
 Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.

 For example, this stream has horizon 9:

 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit

 This is because the word "corpus" occurs on line 5 and line 14. This 
 is the farthest two lines in the file are; notice that "ipsum" also 
 occurs twice, but at a smaller distance.

 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are 
 as important as always.

 This is a task that's interesting and subtle in quite a number of 
 ways, and it allows for many creative solutions. Therefore I 
 intentionally did not restrict it much. Let ideas flow!


 Andrei

 
 Simplicity considering external memory constraints.
 Read in the file, line by line. Output the line followed by the line 
 number. Sort the output (merge sort works wonderfully). Now you go 
 through it a second time and find the maximum distance.
 
 This takes O(n log n) time, which is far from optimal. However, I think 
 we already did sorting in external memory, so I could implement this 
 much quicker than most alternatives. Depending on the number of times 
 this has to be done, that might be a net advantage.

I like this solution a lot. Kudos!

Andrei

Nov 01 2008

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] The horizon of a stream