## digitalmars.D - [OT] The horizon of a stream

• Andrei Alexandrescu (30/30) Oct 23 2008 Consider some text stream. We define the horizon of the stream as the
• bearophile (36/36) Oct 23 2008 First try, D1, Phobos:
• Andrei Alexandrescu (6/7) Oct 23 2008 [snip]
• bearophile (7/10) Oct 23 2008 I understand, but I have 2 GB of RAM, so that solution (in Python) works...
• Andrei Alexandrescu (6/27) Oct 23 2008 Now you're talking! This is a solution worth a lot of attention. Under
• bearophile (6/7) Oct 23 2008 Assuming that the hash function is good, very similar strings too have d...
• bearophile (4/5) Oct 23 2008 There's a third possible solution, that is very fast: assuming the hash ...
• bearophile (20/20) Oct 23 2008 This versions compares hash values first, avoiding some slower string co...
• bearophile (4/8) Oct 23 2008 Ignore this solution, it's not faster, because the whole string is hashe...
• Sergey Gromov (13/47) Oct 23 2008 Traverse the stream, line by line. If the word is new, add it to a
• mgen (69/69) Oct 23 2008 If size is a problem use a trie!
• BCS (21/21) Oct 23 2008 (going at this blind so I might dup others ideas)
• Lionello Lunesu (7/7) Oct 23 2008 With if "corpus" would appear one more time but at a smaller distance?
• BCS (2/11) Oct 23 2008 IMNSHO no the max spread would be from 5 to 16
• Jason House (9/46) Oct 23 2008 By hash value, store a linked list of (line #, char offset within file)
• bearophile (104/105) Oct 24 2008 If the file is "small" my first implementation is good enough (small mea...
• Nigel Sandever (29/66) Oct 25 2008 The hash solution is simple and fast, but uses too much memory. So, sub-...
• Nigel Sandever (3/5) Oct 25 2008 Should have been 12MB/sec. I was instrumenting in 5 second intervals.
• mgen (1/1) Oct 25 2008 Could you post that file somewhere so we could get some even level testi...
• Nigel Sandever (5/6) Oct 25 2008 going for these algos?
• bearophile (4/5) Oct 25 2008 It looks a bit too much huge. Better to write a little program that gene...
• Nigel Sandever (8/16) Oct 25 2008 random data with a suitable distribution. To make it portable across
• bearophile (5/8) Oct 25 2008 A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', ...
• Nigel Sandever (34/45) Oct 25 2008 rest.
• bearophile (9/13) Oct 26 2008 This is a D newsgroup, so use D, it allows you to manage bits more effic...
• Nigel Sandever (18/32) Oct 26 2008 Sorry. No disrespect meant to D. I always prototype in Perl and then con...
• Christopher Wright (9/46) Nov 01 2008 Simplicity considering external memory constraints.
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This is
the farthest two lines in the file are; notice that "ipsum" also occurs
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are as
important as always.

This is a task that's interesting and subtle in quite a number of ways,
and it allows for many creative solutions. Therefore I intentionally did
not restrict it much. Let ideas flow!

Andrei
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```First try, D1, Phobos:

import std.stdio, std.stream;

void main() {
int[string] seen;
int horizon;

int nline;
foreach (string line; new BufferedFile("data.txt")) {
auto pos_ptr = line in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) >  horizon)
horizon = nline - *pos_ptr;
} else
seen[line.dup] = nline;
nline++;
}

writefln(horizon);
}

Idem, with my libs:

import d.all, d.string;

void main() {
int[string] seen;
int horizon;

foreach (nline, line; xfile("data.txt")) {
auto pos_ptr = line in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) >  horizon)
horizon = nline - *pos_ptr;
} else
seen[line.dup] = nline;
}

putr(horizon);
putr(seen);
}

I'll show an "improved" version soon...

Bye,
bearophile
```
Oct 23 2008
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```bearophile wrote:
First try, D1, Phobos:

[snip]

At this point a discussion "ok, I'll store every line in a hashtable"
would have sufficed. I can tell straight away that this is not a
workable solution for a large file. It's actually happened to me :o|.

Andrei
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```Andrei Alexandrescu:
At this point a discussion "ok, I'll store every line in a hashtable"
would have sufficed. I can tell straight away that this is not a
workable solution for a large file. It's actually happened to me :o|.

I understand, but I have 2 GB of RAM, so that solution (in Python) works most
of the times. When you design an algorithm, you start with the simpler version
(that's also easy to debug and test), and only when you experimentally see it
doesn't suffice (too much slow, not enough RAM, etc) you try to improve it.
When you debug the refined version you can also use the unittests developed
with the basic version, this is very useful.

Another solution is to keep an associative array of just the string hash
values, and lookup into the input file when you have a (possible) false
positive.

Another solution that requires even less RAM is to use a bloom filter (I have
created a bloom filter class in Python but not in D yet) to keep what strings
are seen, and lookup into the input file (as before) to remove (possible) false
positives.

Finally, looking up into the input file every time there's a match is too much
slow, so you just store the candidate matches into a file or in RAM, to test
them all in a second read of the file. If the bloom filter is defined well
enough, then such possible matches aren't too many, and you can keep them in
RAM.

Bye,
bearophile
```
Oct 23 2008
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```bearophile wrote:
Andrei Alexandrescu:
At this point a discussion "ok, I'll store every line in a
hashtable" would have sufficed. I can tell straight away that this
is not a workable solution for a large file. It's actually happened
to me :o|.

I understand, but I have 2 GB of RAM, so that solution (in Python)
works most of the times. When you design an algorithm, you start with
the simpler version (that's also easy to debug and test), and only
when you experimentally see it doesn't suffice (too much slow, not
enough RAM, etc) you try to improve it. When you debug the refined
version you can also use the unittests developed with the basic
version, this is very useful.

Another solution is to keep an associative array of just the string
hash values, and lookup into the input file when you have a
(possible) false positive.

Now you're talking! This is a solution worth a lot of attention. Under
what circumstances it does well vs. not so well? How can you ensure it
works for a stream of any size if multiple passes are possible? etc.

Another solution that requires even less RAM is to use a bloom filter
(I have created a bloom filter class in Python but not in D yet) to
keep what strings are seen, and lookup into the input file (as
before) to remove (possible) false positives.

Bloom filter is also an excellent idea!

Andrei
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```Andrei Alexandrescu:
Under what circumstances it does well vs. not so well?

Assuming that the hash function is good, very similar strings too have
different hash values.

When there are lot of equal lines it doesn't work well, I presume, because
there are lot of equal hash values.

Note that in our example the lines are very short, so replacing them with a
hash_t (as the key of the associative array, so it stores this value two times,
because it keeps the hash of the hash value too, that being probably a size_t
or uint, is hashed to itself) doesn't save you that much RAM.

Bye,
bearophile
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```bearophile Wrote:
Another solution that requires even less RAM is to use a bloom filter...

There's a third possible solution, that is very fast: assuming the hash values
are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires
just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different
value of the hash.

Bye,
bearophile
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```This versions compares hash values first, avoiding some slower string compares,
because in D1 strings don't store their hash value:

import d.all, d.string;

void main() {
int[Record!(hash_t, string)] seen;
int horizon;

foreach (nline, line; xfile("data.txt")) {
auto clean_line = line.chomp();
auto hash_clean_line = hash(clean_line);
auto pos_ptr = record(hash_clean_line, clean_line) in seen;
if (pos_ptr) {
if ((nline - *pos_ptr) >  horizon)
horizon = nline - *pos_ptr;
} else
seen[record(hash_clean_line, clean_line.dup)] = nline;
}

putr(horizon);
}

Maybe it's not fully correct yet, so I'll test it some more...

Bye,
bearophile
```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```bearophile:
This versions compares hash values first, avoiding some slower
string compares, because in D1 strings don't store their hash value:
...
auto pos_ptr = record(hash_clean_line, clean_line) in seen;

Ignore this solution, it's not faster, because the whole string is hashed two
times, etc. You have to use a String class like mine that lazily computes and
stores the hash value too...

Bye,
bearophile
```
Oct 23 2008
Sergey Gromov <snake.scaly gmail.com> writes:
```Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This is
the farthest two lines in the file are; notice that "ipsum" also occurs
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are as
important as always.

This is a task that's interesting and subtle in quite a number of ways,
and it allows for many creative solutions. Therefore I intentionally did
not restrict it much. Let ideas flow!

Traverse the stream, line by line.  If the word is new, add it to a
dictionary, size_t[string], with the current line number.  If it's
already in the dictionary, look at the distance to its first occurence
and update horizon if that distance is greater than what we already have.
If we don't run out of memory, we're done in one pass.

If we do run out of memory, recover ;) then continue without adding words
into the dictionary but dumping them into a temporary file instead, just
checking and updating the horizon.

The stream ends.  If our current horizon is greater than the number of
lines we dumped into a temp file, we're done.  Otherwise, empty the
dictionary, open the temp file, and repeat.
```
Oct 23 2008
mgen <bmeck stedwards.edu> writes:
```If size is a problem use a trie!

//---CODE---
import std.stream;
import std.stdio;

class position_trie
{
bool seen = false;
int last_seen = 0;
position_trie[char] branch;
}

void main()
{
int[char[]] last;
int max;
Stream file = new MemoryStream(
"lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit"
);
auto root = new position_trie();
position_trie node = root;
position_trie next;
foreach(ulong lineno,char[] line;file)
{
foreach(int chrno,char chr;line)
{
if(chr in node.branch)
{
node = node.branch[chr];
}
else
{
next = new position_trie();
node.branch[chr] = next;
node = next;
foreach(char chr2; line[chrno+1..\$])
{
next = new position_trie();
node.branch[chr2] = next;
node = next;
}
break;
}
}
if(node.seen)
{
auto dist = lineno - node.last_seen;
if(dist > max)
max = dist;
}
node.seen = true;
node.last_seen = lineno;
node = root;
}
writefln(max);
}
```
Oct 23 2008
```(going at this blind so I might dup others ideas)

- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash

- look for the hash bucket with the largest spread of line indexes
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.

The main issue will be the what-to-pick-next-function. It must be fast, not
requiter to much meta data, and work well with the cache.

Another option:

load every line into a database table with it's index:

INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;

if(c == 0)
return 0;
else
return i;
```
Oct 23 2008
Lionello Lunesu <lio lunesu.remove.com> writes:
```With if "corpus" would appear one more time but at a smaller distance?

line 5: corpus
...
line 14: corpus
..
line 16: corpus

Would this stream have a different horizon?
```
Oct 23 2008
```Reply to Lionello,

With if "corpus" would appear one more time but at a smaller distance?

line 5: corpus
...
line 14: corpus
..
line 16: corpus
Would this stream have a different horizon?

IMNSHO no the max spread would be from 5 to 16
```
Oct 23 2008
Jason House <jason.james.house gmail.com> writes:
```By hash value, store a linked list of (line #, char offset within file)

When a hash collision occurs, append to the list, but also calculate the
horizon from the head of the list.

While the horizon is higher than previously observed, scan the list until a
string match is found and update the maximum observed horizon.

This performance is linear, but has extra string lookups over some
alternatives. It's possible to optimize the string comparisons with some
kind of temporary caching of strings such as weak pointers.

Andrei Alexandrescu wrote:

Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This is
the farthest two lines in the file are; notice that "ipsum" also occurs
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are as
important as always.

This is a task that's interesting and subtle in quite a number of ways,
and it allows for many creative solutions. Therefore I intentionally did
not restrict it much. Let ideas flow!

Andrei

```
Oct 23 2008
bearophile <bearophileHUGS lycos.com> writes:
```bearophile Wrote:
There's a third possible solution, that is very fast: assuming the hash values
are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires
just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different
value of the hash.

If the file is "small" my first implementation is good enough (small means the
associative array has to fit in for example 2 GB). For larger files here's the
implementation of the third idea, with D1, Phobos + my libs:

import d.all, std.string, std.c.stdlib, std.outofmemory;

const uint bpc = uint.sizeof * 8;

bool isFlagSet(uint* flags, uint i) {
int offset = i / bpc;
uint mask = 1 << (i % bpc);
return (flags[offset] & mask) != 0;
}

void setFlag(uint* flags, uint i) {
int offset = i / bpc;
uint mask = 1 << (i % bpc);
if ((flags[offset] & mask) == 0)
}

void main() {
uint* seen = cast(uint*)calloc(1 << 29, 1);
uint* seen_again = cast(uint*)calloc(1 << 29, 1);
if (seen is null || seen_again is null)
throw new OutOfMemoryException();

foreach (line; xfile("data.txt")) {
uint h = hash(line.chomp());
if (isFlagSet(seen, h))
// not used std.intrinsics, because they segfault here
setFlag(seen_again, h);
else
setFlag(seen, h);
}
free(seen);

int[string] aa;
int horizon;
foreach (nline, line; xfile("data.txt")) {
auto clean_line = line.chomp();
if (isFlagSet(seen_again, hash(clean_line))) {
auto pos_ptr = clean_line in aa;
if (pos_ptr) {
if ((nline - *pos_ptr) > horizon)
horizon = nline - *pos_ptr;
} else
aa[clean_line.dup] = nline;
}
}

putr(horizon);
}

With D1, just Phobos:

import std.stdio, std.string, std.c.stdlib, std.stream, std.outofmemory;

const uint bpc = uint.sizeof * 8;

bool isFlagSet(uint* flags, uint i) {
int offset = i / bpc;
uint mask = 1 << (i % bpc);
return (flags[offset] & mask) != 0;
}

void setFlag(uint* flags, uint i) {
int offset = i / bpc;
uint mask = 1 << (i % bpc);
if ((flags[offset] & mask) == 0)
}

void main() {
uint* seen = cast(uint*)calloc(1 << 29, 1);
uint* seen_again = cast(uint*)calloc(1 << 29, 1);
if (seen is null || seen_again is null)
throw new OutOfMemoryException();

foreach (string line; new BufferedFile("data.txt")) {
auto clean_line = line.chomp();
uint h = typeid(string).getHash(&clean_line);
if (isFlagSet(seen, h))
// not used std.intrinsics, because they segfault here
setFlag(seen_again, h);
else
setFlag(seen, h);
}
free(seen);

int[string] aa;
int horizon, nline;
foreach (string line; new BufferedFile("data.txt")) {
auto clean_line = line.chomp();
if (isFlagSet(seen_again, typeid(string).getHash(&clean_line))) {
auto pos_ptr = clean_line in aa;
if (pos_ptr) {
if ((nline - *pos_ptr) > horizon)
horizon = nline - *pos_ptr;
} else
aa[clean_line.dup] = nline;
}
nline++;
}

writefln(horizon);
}

The worst case is like this:
A
B
C
D
E
E
D
C
B
A

In this case the aa associative array has to store half of the lines of the
file, and it be too much.

As you can see with this little program I have found another bug in Phobos, the
intrinsics in this case don't work (and I have improved the Record of my libs
too, used in another version of this code).

Bye,
bearophile
```
Oct 24 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This is
the farthest two lines in the file are; notice that "ipsum" also occurs
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are as
important as always.

This is a task that's interesting and subtle in quite a number of ways,
and it allows for many creative solutions. Therefore I intentionally did
not restrict it much. Let ideas flow!

Andrei

The hash solution is simple and fast, but uses too much memory. So, sub-divide

Obviously, you cannot process the first half and then the second, but you can
make multiple complete passes, and only consider a subset during each pass.

For example, if you hash all lines beginning with 'a' (or 'A' or whatever is
applicable with  your data), and 'b' on the second pass etc. At the end of each
pass you retain only the best horizion so far discovered and discard the rest.

Assuming all lower-case and roughly even distribution, you reduce your memory
requirements to ~4%.

By way of example, I knocked up a data file constisting of 100+ million lines
(1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got
my result in just under 25 minutes using < 2MB of memory.

Admittedly, with this size of file, the entire thing was in the file system
caches for the second and subsequent passes which saved considerable time. If
your file is much bigger, that probably wouldn't be the case. But, if you're
going to write this in D, you can probably speed thing up a bit anyway.

wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my
system, where perl only achieved a best of 60mb/sec with a warm cache.

I seem to recall seeing a memory-mapped file module in the D library somewhere.
Maybe that would reduce the IO for multiple passes?

Also, given the minimal 2MB max. memory usage above, if you know (or can probe)
your data, then you can reduce the number of passes by doing (say) a-f on the
first pass, g-l on the second etc.

Or if all your data begins with (say) "Line ...", then key off the appropriate
position in the line to form your pass split. Or, if your file is really huge
use the first two characters for a 676-way split. (And wait :)

Anyway, just a thought from a drive-by reader.
```
Oct 25 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Sun, 26 Oct 2008 01:00:59 GMT, Nigel Sandever <nigelsandever btconnect.com>
wrote:

system, where perl only achieved a best of 60mb/sec with a warm cache.

Should have been 12MB/sec. I was instrumenting in 5 second intervals.
```
Oct 25 2008
mgen <bmeck stedwards.edu> writes:
```Could you post that file somewhere so we could get some even level testing
going for these algos?
```
Oct 25 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Sat, 25 Oct 2008 21:45:29 -0400, mgen <bmeck stedwards.edu> wrote:
Could you post that file somewhere so we could get some even level testing

going for these algos?

Sure. If you know somewhere that woudl allow me to upload a (bzip -9) 300Mb
file?

And are prepared to wait a while, I'm on dial-up :(
```
Oct 25 2008
bearophile <bearophileHUGS lycos.com> writes:
```mgen:
Could you post that file somewhere so we could get some even level testing
going for these algos?

It looks a bit too much huge. Better to write a little program that generates
random data with a suitable distribution. To make it portable across
Phobos/Tango/etc the RND generator can be included into the code, there are
very fast around.

Bye,
bearophile
```
Oct 25 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Sat, 25 Oct 2008 22:54:47 -0400, bearophile <bearophileHUGS lycos.com> wrote:
mgen:
Could you post that file somewhere so we could get some even level testing

going for these algos?

It looks a bit too much huge. Better to write a little program that generates

random data with a suitable distribution. To make it portable across
Phobos/Tango/etc the RND generator can be included into the code, there are
very
fast around.

Bye,
bearophile

You'd need the PRNG to produce the same sequence for teh same seed cross
platform--the MT for example.

And you'd need to distribute (or point to) a common dictionary.
```
Oct 25 2008
bearophile <bearophileHUGS lycos.com> writes:
```Nigel Sandever:
For example, if you hash all lines beginning with 'a' (or 'A' or whatever is
applicable with  your data), and 'b' on the second pass etc. At the end of
each
pass you retain only the best horizion so far discovered and discard the rest.

A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to
split the input data into bins. But you want to fill those bins uniformly. So
how can do it? The hash number if good is nearly random, so you can partition
its bits to partition your data into more uniform groups :-) The downside is
that you have to compute the hash for each line many times. To speed this up
you can hash just a sample of the chars of each line...

Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM...
You don't need to load the lines as strings, you can load the whole file as a
single memory block in RAM. Then you don't even need slices (8 bytes) to
represent a string, because their ending point is the end of the file or the
successive newline, so you just need the starting points as keys and the line
indexes as values. Unfortunately the built-in AA with the built-in memory
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you
see memory isn't enough to run your program, then you need to use a less
memory-hungry hash map. You can find plenty around, written in C, I have
translated one to D, it's easy. With that if the lines aren't too much short or
there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to
keep the whole associative array, so you need to load the file just one time.

Bye,
bearophile
```
Oct 25 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS lycos.com> wrote:
Nigel Sandever:
For example, if you hash all lines beginning with 'a' (or 'A' or whatever is
applicable with  your data), and 'b' on the second pass etc. At the end of

each
pass you retain only the best horizion so far discovered and discard the

rest.

A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,

to split the input data into bins. But you want to fill those bins uniformly.
So
how can do it? The hash number if good is nearly random, so you can partition
its bits to partition your data into more uniform groups :-) The downside is
that you have to compute the hash for each line many times. To speed this up
you
can hash just a sample of the chars of each line...

I did try that (using md5), but the penalty in Perl was horrible, The hash(AA)
just had to rehash the binary md5 anyway, so it was pure cost for my known
dataset.

For a real application with unknown/non-uniform data it would work like a charm
though ... assuming you don't hit that rarest of non-impossibilities: duplicate
md5s of non-identical inputs :)

That said. For most applications of this, one would probably have some feel for
the dataset(s) one is likely to deal with. And trading domain specific
knowledge
for total generality is often the best optimisation one can perform.

Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM...

Agreed and ditto. But provided algorithms are written to assume files larger
than memory, the numbers do scale linearly.

You don't need to load the lines as strings, you can load the whole file as a
single memory block in RAM. Then you don't even need slices (8 bytes) to
represent a string, because their ending point is the end of the file or the
successive newline, so you just need the starting points as keys and the line
indexes as values. Unfortunately the built-in AA with the built-in memory
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you
see
memory isn't enough to run your program, then you need to use a less memory-
hungry hash map. You can find plenty around, written in C, I have translated
one
to D, it's easy. With that if the lines aren't too much short or there are some
duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole
associative array, so you need to load the file just one time.

I agree that a 'testset producer' and a freely accessible known dicionary is a
better option.

I used (a slightly modified version of) 2of12inf available from

Bye,
bearophile

```
Oct 25 2008
bearophile <bearophileHUGS lycos.com> writes:
```Nigel Sandever:

I did try that (using md5), but the penalty in Perl was horrible,<

This is a D newsgroup, so use D, it allows you to manage bits more efficiently.

md5 is too much slow for this purpose, use the built-in hashing of strings.

That said. For most applications of this, one would probably have some feel for
the dataset(s) one is likely to deal with. And trading domain specific
knowledge for total generality is often the best optimisation one can perform.<

I agree, when you have gigabytes of data it's better to gain some knowledge
about the data before process it.

But provided algorithms are written to assume files larger than memory, the
numbers do scale linearly.<

Several of the provided algorithms don't assume files larger than memory, or
they work quite better when the file is smaller than memory.

I used (a slightly modified version of) 2of12inf available from<

That's a quite complex file, so I suggest something simpler, as this after a
cleaning of the non ASCII words:
http://www.norvig.com/big.txt

Bye,
bearophile
```
Oct 26 2008
Nigel Sandever <nigelsandever btconnect.com> writes:
```On Sun, 26 Oct 2008 03:39:50 -0400, bearophile <bearophileHUGS lycos.com> wrote:
Nigel Sandever:

I did try that (using md5), but the penalty in Perl was horrible,<

This is a D newsgroup, so use D, it allows you to manage bits more

efficiently.

Sorry. No disrespect meant to D. I always prototype in Perl and then convert to
C or D if I need performance. I'm just more familiar with Perl.

I used (a slightly modified version of) 2of12inf available from<

That's a quite complex file, so I suggest something simpler, as this after a

cleaning of the non ASCII words:
http://www.norvig.com/big.txt

I don't know what is "complex" about a 1 word per line, 81536 line dictionary
file?

Or how having everyone clean up Conan Doyle would be simpler?

If you have Perl, you can produce a suitable testfile from any 1 word per line
dictionary with the command line:

perl -l12n0777aF/\n/ -ne'print \$F[rand  F] for 1..4e8' yourdict >thedata

With the 2of12inf dictionary file, 4e8 produces a file a little under 4GB in a
round 10 minutes. YMWV depending upon the average length of the lines in your
local dict.

Of course the won't all be the same as mine, or anyone elses, but given the
random nature, the results will be broadly comparible.

My D foo is too rusty to try and write that in D. Especially for a D audience :)
I'm sure one of you guys can knock that up in the blink of an eye.

Bye,
bearophile

```
Oct 26 2008
Christopher Wright <dhasenan gmail.com> writes:
```Andrei Alexandrescu wrote:
Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This is
the farthest two lines in the file are; notice that "ipsum" also occurs
twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are as
important as always.

This is a task that's interesting and subtle in quite a number of ways,
and it allows for many creative solutions. Therefore I intentionally did
not restrict it much. Let ideas flow!

Andrei

Simplicity considering external memory constraints.
Read in the file, line by line. Output the line followed by the line
number. Sort the output (merge sort works wonderfully). Now you go
through it a second time and find the maximum distance.

This takes O(n log n) time, which is far from optimal. However, I think
we already did sorting in external memory, so I could implement this
much quicker than most alternatives. Depending on the number of times
this has to be done, that might be a net advantage.
```
Nov 01 2008
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```Christopher Wright wrote:
Andrei Alexandrescu wrote:
Consider some text stream. We define the horizon of the stream as the
longest number of lines between two identical lines minus one, or zero
if the stream consists of unique lines.

For example, this stream has horizon 9:

lorem
ipsum
dolor
habeas
corpus
ipsum
sit
amet
rigor
mortis
consectetur
coitus
interruptus
corpus
elit

This is because the word "corpus" occurs on line 5 and line 14. This
is the farthest two lines in the file are; notice that "ipsum" also
occurs twice, but at a smaller distance.

The challenge, should you blah blah, is to compute the horizon of a
potentially large file. Of course theoretical and practical speed are
as important as always.

This is a task that's interesting and subtle in quite a number of
ways, and it allows for many creative solutions. Therefore I
intentionally did not restrict it much. Let ideas flow!

Andrei

Simplicity considering external memory constraints.
Read in the file, line by line. Output the line followed by the line
number. Sort the output (merge sort works wonderfully). Now you go
through it a second time and find the maximum distance.

This takes O(n log n) time, which is far from optimal. However, I think
we already did sorting in external memory, so I could implement this
much quicker than most alternatives. Depending on the number of times
this has to be done, that might be a net advantage.

I like this solution a lot. Kudos!

Andrei
```
Nov 01 2008