www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] The horizon of a stream

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Consider some text stream. We define the horizon of the stream as the 
longest number of lines between two identical lines minus one, or zero 
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit

This is because the word "corpus" occurs on line 5 and line 14. This is 
the farthest two lines in the file are; notice that "ipsum" also occurs 
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a 
potentially large file. Of course theoretical and practical speed are as 
important as always.

This is a task that's interesting and subtle in quite a number of ways, 
and it allows for many creative solutions. Therefore I intentionally did 
not restrict it much. Let ideas flow!


Andrei
Oct 23 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
First try, D1, Phobos:

import std.stdio, std.stream;

void main() {
    int[string] seen;
    int horizon;

    int nline;
    foreach (string line; new BufferedFile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
        nline++;
    }

    writefln(horizon);
}


Idem, with my libs:

import d.all, d.string;

void main() {
    int[string] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto pos_ptr = line in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[line.dup] = nline;
    }

    putr(horizon);
    putr(seen);
}

I'll show an "improved" version soon...

Bye,
bearophile
Oct 23 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 First try, D1, Phobos:
[snip] At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|. Andrei
Oct 23 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 At this point a discussion "ok, I'll store every line in a hashtable" 
 would have sufficed. I can tell straight away that this is not a 
 workable solution for a large file. It's actually happened to me :o|.
I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful. Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive. Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives. Finally, looking up into the input file every time there's a match is too much slow, so you just store the candidate matches into a file or in RAM, to test them all in a second read of the file. If the bloom filter is defined well enough, then such possible matches aren't too many, and you can keep them in RAM. Bye, bearophile
Oct 23 2008
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 At this point a discussion "ok, I'll store every line in a
 hashtable" would have sufficed. I can tell straight away that this
 is not a workable solution for a large file. It's actually happened
 to me :o|.
I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful. Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive.
Now you're talking! This is a solution worth a lot of attention. Under what circumstances it does well vs. not so well? How can you ensure it works for a stream of any size if multiple passes are possible? etc.
 Another solution that requires even less RAM is to use a bloom filter
 (I have created a bloom filter class in Python but not in D yet) to
 keep what strings are seen, and lookup into the input file (as
 before) to remove (possible) false positives.
Bloom filter is also an excellent idea! Andrei
Oct 23 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 Under what circumstances it does well vs. not so well?
Assuming that the hash function is good, very similar strings too have different hash values. When there are lot of equal lines it doesn't work well, I presume, because there are lot of equal hash values. Note that in our example the lines are very short, so replacing them with a hash_t (as the key of the associative array, so it stores this value two times, because it keeps the hash of the hash value too, that being probably a size_t or uint, is hashed to itself) doesn't save you that much RAM. Bye, bearophile
Oct 23 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile Wrote:
 Another solution that requires even less RAM is to use a bloom filter...
There's a third possible solution, that is very fast: assuming the hash values are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different value of the hash. Bye, bearophile
Oct 23 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
This versions compares hash values first, avoiding some slower string compares,
because in D1 strings don't store their hash value:

import d.all, d.string;

void main() {
    int[Record!(hash_t, string)] seen;
    int horizon;

    foreach (nline, line; xfile("data.txt")) {
        auto clean_line = line.chomp();
        auto hash_clean_line = hash(clean_line);
        auto pos_ptr = record(hash_clean_line, clean_line) in seen;
        if (pos_ptr) {
            if ((nline - *pos_ptr) >  horizon)
                 horizon = nline - *pos_ptr;
        } else
            seen[record(hash_clean_line, clean_line.dup)] = nline;
    }

    putr(horizon);
}

Maybe it's not fully correct yet, so I'll test it some more...

Bye,
bearophile
Oct 23 2008
parent bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 This versions compares hash values first, avoiding some slower
 string compares, because in D1 strings don't store their hash value:
...
         auto pos_ptr = record(hash_clean_line, clean_line) in seen;
Ignore this solution, it's not faster, because the whole string is hashed two times, etc. You have to use a String class like mine that lazily computes and stores the hash value too... Bye, bearophile
Oct 23 2008
prev sibling next sibling parent Sergey Gromov <snake.scaly gmail.com> writes:
Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!
Traverse the stream, line by line. If the word is new, add it to a dictionary, size_t[string], with the current line number. If it's already in the dictionary, look at the distance to its first occurence and update horizon if that distance is greater than what we already have. If we don't run out of memory, we're done in one pass. If we do run out of memory, recover ;) then continue without adding words into the dictionary but dumping them into a temporary file instead, just checking and updating the horizon. The stream ends. If our current horizon is greater than the number of lines we dumped into a temp file, we're done. Otherwise, empty the dictionary, open the temp file, and repeat.
Oct 23 2008
prev sibling next sibling parent mgen <bmeck stedwards.edu> writes:
If size is a problem use a trie!

//---CODE---
import std.stream;
import std.stdio;

class position_trie
{
bool seen = false;
int last_seen = 0;
position_trie[char] branch;
}

void main()
{
  int[char[]] last;
  int max;
  Stream file = new MemoryStream(
    "lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
adipisicing
elit"
  );
  auto root = new position_trie();
  position_trie node = root;
  position_trie next;
  foreach(ulong lineno,char[] line;file)
  {
    foreach(int chrno,char chr;line)
    {
      if(chr in node.branch)
      {
        node = node.branch[chr];
      }
      else
      {
        next = new position_trie();
        node.branch[chr] = next;
        node = next;
        foreach(char chr2; line[chrno+1..$])
        {
          next = new position_trie();
          node.branch[chr2] = next;
          node = next;
        }
        break;
      }
    }
    if(node.seen)
    {
      auto dist = lineno - node.last_seen;
      if(dist > max)
        max = dist;
    }
    node.seen = true;
    node.last_seen = lineno;
    node = root;
  }
  writefln(max);
}
Oct 23 2008
prev sibling next sibling parent BCS <ao pathlink.com> writes:
(going at this blind so I might dup others ideas)

- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash

- look for the hash bucket with the largest spread of line indexes
- grab that spread
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.

The main issue will be the what-to-pick-next-function. It must be fast, not 
requiter to much meta data, and work well with the cache.



Another option:

load every line into a database table with it's index:

INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;

if(c == 0)
  return 0;
else
  return i;
Oct 23 2008
prev sibling next sibling parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
With if "corpus" would appear one more time but at a smaller distance?

line 5: corpus
...
line 14: corpus
..
line 16: corpus

Would this stream have a different horizon?
Oct 23 2008
parent BCS <ao pathlink.com> writes:
Reply to Lionello,

 With if "corpus" would appear one more time but at a smaller distance?
 
 line 5: corpus
 ...
 line 14: corpus
 ..
 line 16: corpus
 Would this stream have a different horizon?
 
IMNSHO no the max spread would be from 5 to 16
Oct 23 2008
prev sibling next sibling parent Jason House <jason.james.house gmail.com> writes:


When a hash collision occurs, append to the list, but also calculate the
horizon from the head of the list.

While the horizon is higher than previously observed, scan the list until a
string match is found and update the maximum observed horizon.


This performance is linear, but has extra string lookups over some
alternatives. It's possible to optimize the string comparisons with some
kind of temporary caching of strings such as weak pointers.



Andrei Alexandrescu wrote:

 Consider some text stream. We define the horizon of the stream as the
 longest number of lines between two identical lines minus one, or zero
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is
 the farthest two lines in the file are; notice that "ipsum" also occurs
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a
 potentially large file. Of course theoretical and practical speed are as
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways,
 and it allows for many creative solutions. Therefore I intentionally did
 not restrict it much. Let ideas flow!
 
 
 Andrei
Oct 23 2008
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile Wrote:
 There's a third possible solution, that is very fast: assuming the hash values
are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires
just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different
value of the hash.
If the file is "small" my first implementation is good enough (small means the associative array has to fit in for example 2 GB). For larger files here's the implementation of the third idea, with D1, Phobos + my libs: import d.all, std.string, std.c.stdlib, std.outofmemory; const uint bpc = uint.sizeof * 8; bool isFlagSet(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); return (flags[offset] & mask) != 0; } void setFlag(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); if ((flags[offset] & mask) == 0) flags[offset] |= mask; } void main() { uint* seen = cast(uint*)calloc(1 << 29, 1); uint* seen_again = cast(uint*)calloc(1 << 29, 1); if (seen is null || seen_again is null) throw new OutOfMemoryException(); foreach (line; xfile("data.txt")) { uint h = hash(line.chomp()); if (isFlagSet(seen, h)) // not used std.intrinsics, because they segfault here setFlag(seen_again, h); else setFlag(seen, h); } free(seen); int[string] aa; int horizon; foreach (nline, line; xfile("data.txt")) { auto clean_line = line.chomp(); if (isFlagSet(seen_again, hash(clean_line))) { auto pos_ptr = clean_line in aa; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else aa[clean_line.dup] = nline; } } putr(horizon); } With D1, just Phobos: import std.stdio, std.string, std.c.stdlib, std.stream, std.outofmemory; const uint bpc = uint.sizeof * 8; bool isFlagSet(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); return (flags[offset] & mask) != 0; } void setFlag(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); if ((flags[offset] & mask) == 0) flags[offset] |= mask; } void main() { uint* seen = cast(uint*)calloc(1 << 29, 1); uint* seen_again = cast(uint*)calloc(1 << 29, 1); if (seen is null || seen_again is null) throw new OutOfMemoryException(); foreach (string line; new BufferedFile("data.txt")) { auto clean_line = line.chomp(); uint h = typeid(string).getHash(&clean_line); if (isFlagSet(seen, h)) // not used std.intrinsics, because they segfault here setFlag(seen_again, h); else setFlag(seen, h); } free(seen); int[string] aa; int horizon, nline; foreach (string line; new BufferedFile("data.txt")) { auto clean_line = line.chomp(); if (isFlagSet(seen_again, typeid(string).getHash(&clean_line))) { auto pos_ptr = clean_line in aa; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else aa[clean_line.dup] = nline; } nline++; } writefln(horizon); } The worst case is like this: A B C D E E D C B A In this case the aa associative array has to store half of the lines of the file, and it be too much. As you can see with this little program I have found another bug in Phobos, the intrinsics in this case don't work (and I have improved the Record of my libs too, used in another version of this code). Bye, bearophile
Oct 24 2008
prev sibling next sibling parent reply Nigel Sandever <nigelsandever btconnect.com> writes:
On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!
 
 
 Andrei
The hash solution is simple and fast, but uses too much memory. So, sub-divide your data. Obviously, you cannot process the first half and then the second, but you can make multiple complete passes, and only consider a subset during each pass. For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with your data), and 'b' on the second pass etc. At the end of each pass you retain only the best horizion so far discovered and discard the rest. Assuming all lower-case and roughly even distribution, you reduce your memory requirements to ~4%. By way of example, I knocked up a data file constisting of 100+ million lines (1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got my result in just under 25 minutes using < 2MB of memory. Admittedly, with this size of file, the entire thing was in the file system caches for the second and subsequent passes which saved considerable time. If your file is much bigger, that probably wouldn't be the case. But, if you're going to write this in D, you can probably speed thing up a bit anyway. wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my system, where perl only achieved a best of 60mb/sec with a warm cache. I seem to recall seeing a memory-mapped file module in the D library somewhere. Maybe that would reduce the IO for multiple passes? Also, given the minimal 2MB max. memory usage above, if you know (or can probe) your data, then you can reduce the number of passes by doing (say) a-f on the first pass, g-l on the second etc. Or if all your data begins with (say) "Line ...", then key off the appropriate position in the line to form your pass split. Or, if your file is really huge use the first two characters for a 676-way split. (And wait :) Anyway, just a thought from a drive-by reader.
Oct 25 2008
next sibling parent Nigel Sandever <nigelsandever btconnect.com> writes:
On Sun, 26 Oct 2008 01:00:59 GMT, Nigel Sandever <nigelsandever btconnect.com> 
wrote:

 system, where perl only achieved a best of 60mb/sec with a warm cache.
 
Should have been 12MB/sec. I was instrumenting in 5 second intervals.
Oct 25 2008
prev sibling next sibling parent reply mgen <bmeck stedwards.edu> writes:
Could you post that file somewhere so we could get some even level testing
going for these algos?
Oct 25 2008
next sibling parent Nigel Sandever <nigelsandever btconnect.com> writes:
On Sat, 25 Oct 2008 21:45:29 -0400, mgen <bmeck stedwards.edu> wrote:
 Could you post that file somewhere so we could get some even level testing 
going for these algos? Sure. If you know somewhere that woudl allow me to upload a (bzip -9) 300Mb file? And are prepared to wait a while, I'm on dial-up :(
Oct 25 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
mgen:
 Could you post that file somewhere so we could get some even level testing
going for these algos?
It looks a bit too much huge. Better to write a little program that generates random data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around. Bye, bearophile
Oct 25 2008
parent Nigel Sandever <nigelsandever btconnect.com> writes:
On Sat, 25 Oct 2008 22:54:47 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 mgen:
 Could you post that file somewhere so we could get some even level testing 
going for these algos?
 
 It looks a bit too much huge. Better to write a little program that generates 
random data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around.
 
 Bye,
 bearophile
 
You'd need the PRNG to produce the same sequence for teh same seed cross platform--the MT for example. And you'd need to distribute (or point to) a common dictionary.
Oct 25 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Nigel Sandever:
 For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
 applicable with  your data), and 'b' on the second pass etc. At the end of
each 
 pass you retain only the best horizion so far discovered and discard the rest.
A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line... Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory-hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time. Bye, bearophile
Oct 25 2008
parent reply Nigel Sandever <nigelsandever btconnect.com> writes:
On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 Nigel Sandever:
 For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
 applicable with  your data), and 'b' on the second pass etc. At the end of 
each
 pass you retain only the best horizion so far discovered and discard the 
rest.
 
 A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,
to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line... I did try that (using md5), but the penalty in Perl was horrible, The hash(AA) just had to rehash the binary md5 anyway, so it was pure cost for my known dataset. For a real application with unknown/non-uniform data it would work like a charm though ... assuming you don't hit that rarest of non-impossibilities: duplicate md5s of non-identical inputs :) That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform.
 
 Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... 
Agreed and ditto. But provided algorithms are written to assume files larger than memory, the numbers do scale linearly. You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory- hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time. I agree that a 'testset producer' and a freely accessible known dicionary is a better option. I used (a slightly modified version of) 2of12inf available from http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gz
 
 Bye,
 bearophile
Oct 25 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Nigel Sandever:

I did try that (using md5), but the penalty in Perl was horrible,<
This is a D newsgroup, so use D, it allows you to manage bits more efficiently. md5 is too much slow for this purpose, use the built-in hashing of strings.
That said. For most applications of this, one would probably have some feel for
the dataset(s) one is likely to deal with. And trading domain specific
knowledge for total generality is often the best optimisation one can perform.<
I agree, when you have gigabytes of data it's better to gain some knowledge about the data before process it.
But provided algorithms are written to assume files larger than memory, the
numbers do scale linearly.<
Several of the provided algorithms don't assume files larger than memory, or they work quite better when the file is smaller than memory.
I used (a slightly modified version of) 2of12inf available from<
That's a quite complex file, so I suggest something simpler, as this after a cleaning of the non ASCII words: http://www.norvig.com/big.txt Bye, bearophile
Oct 26 2008
parent Nigel Sandever <nigelsandever btconnect.com> writes:
On Sun, 26 Oct 2008 03:39:50 -0400, bearophile <bearophileHUGS lycos.com> wrote:
 Nigel Sandever:
 
I did try that (using md5), but the penalty in Perl was horrible,<
This is a D newsgroup, so use D, it allows you to manage bits more
efficiently.
 
Sorry. No disrespect meant to D. I always prototype in Perl and then convert to C or D if I need performance. I'm just more familiar with Perl.
 
I used (a slightly modified version of) 2of12inf available from<
That's a quite complex file, so I suggest something simpler, as this after a
cleaning of the non ASCII words:
 http://www.norvig.com/big.txt
I don't know what is "complex" about a 1 word per line, 81536 line dictionary file? Or how having everyone clean up Conan Doyle would be simpler? If you have Perl, you can produce a suitable testfile from any 1 word per line dictionary with the command line: perl -l12n0777aF/\n/ -ne'print $F[rand F] for 1..4e8' yourdict >thedata With the 2of12inf dictionary file, 4e8 produces a file a little under 4GB in a round 10 minutes. YMWV depending upon the average length of the lines in your local dict. Of course the won't all be the same as mine, or anyone elses, but given the random nature, the results will be broadly comparible. My D foo is too rusty to try and write that in D. Especially for a D audience :) I'm sure one of you guys can knock that up in the blink of an eye.
 
 Bye,
 bearophile
Oct 26 2008
prev sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.
 
 For example, this stream has horizon 9:
 
 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit
 
 This is because the word "corpus" occurs on line 5 and line 14. This is 
 the farthest two lines in the file are; notice that "ipsum" also occurs 
 twice, but at a smaller distance.
 
 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are as 
 important as always.
 
 This is a task that's interesting and subtle in quite a number of ways, 
 and it allows for many creative solutions. Therefore I intentionally did 
 not restrict it much. Let ideas flow!
 
 
 Andrei
Simplicity considering external memory constraints. Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance. This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
Nov 01 2008
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Christopher Wright wrote:
 Andrei Alexandrescu wrote:
 Consider some text stream. We define the horizon of the stream as the 
 longest number of lines between two identical lines minus one, or zero 
 if the stream consists of unique lines.

 For example, this stream has horizon 9:

 lorem
 ipsum
 dolor
 habeas
 corpus
 ipsum
 sit
 amet
 rigor
 mortis
 consectetur
 coitus
 interruptus
 corpus
 adipisicing
 elit

 This is because the word "corpus" occurs on line 5 and line 14. This 
 is the farthest two lines in the file are; notice that "ipsum" also 
 occurs twice, but at a smaller distance.

 The challenge, should you blah blah, is to compute the horizon of a 
 potentially large file. Of course theoretical and practical speed are 
 as important as always.

 This is a task that's interesting and subtle in quite a number of 
 ways, and it allows for many creative solutions. Therefore I 
 intentionally did not restrict it much. Let ideas flow!


 Andrei
Simplicity considering external memory constraints. Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance. This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
I like this solution a lot. Kudos! Andrei
Nov 01 2008