digitalmars.D.learn - Trying to reduce memory usage

Josh (31/31) Feb 11 2021 I'm trying to read in a text file that has many duplicated lines

mw (4/6) Feb 11 2021 If you only need to remove duplicates, keep (and compare) a

H. S. Teoh (19/26) Feb 11 2021 [...]

frame (9/14) Feb 11 2021 But this kind of hash is maybe insufficient to avoid hash

frame (7/21) Feb 11 2021 Forgot to mention that this kind of solution needs a better
H. S. Teoh (10/23) Feb 12 2021 I used toHash merely as an example. Obviously, you should use a hash

=?UTF-8?Q?Ali_=c3=87ehreli?= (16/17) Feb 12 2021 I would start with an even simpler solution until it's proven that there...

Daniel N (3/20) Feb 13 2021 Try this Boost Licensed tool.

tsbockman (58/68) Feb 16 2021 I spent some time experimenting with this problem, and here is

tsbockman (6/11) Feb 16 2021 Oops, I think the memory consumption should be (sourceFile.length
Jon Degenhardt (17/21) Feb 18 2021 It would be interesting to see how the performance compares to

tsbockman (61/71) Feb 22 2021 My program (called line-dedup below) is modestly faster than

Jon Degenhardt (15/32) Feb 22 2021 Thanks for running the comparison! I appreciate seeing how other

Josh <moonburntm gmail.com> writes:

I'm trying to read in a text file that has many duplicated lines 
and output a file with all the duplicates removed. By the end of 
this code snippet, the memory usage is ~5x the size of the infile 
(which can be multiple GB each), and when this is in a loop the 
memory usage becomes unmanageable and often results in an 
OutOfMemory error or just a complete lock up of the system. Is 
there a way to reduce the memory usage of this code without 
sacrificing speed to any noticeable extent? My assumption is the 
.sort.uniq needs improving, but I can't think of an easier/not 
much slower way of doing it.

Windows 10 x64
LDC - the LLVM D compiler (1.21.0-beta1):
   based on DMD v2.091.0 and LLVM 10.0.0

-----------------------------------

auto filename = "path\\to\\file.txt.temp";
auto array = appender!(string[]);
File infile = File(filename, "r");
foreach (line; infile.byLine) {
   array ~= line.to!string;
}
File outfile = File(stripExtension(filename), "w");
foreach (element; (array[]).sort.uniq) {
   outfile.myrawWrite(element ~ "\n"); // used to not print the \r 
on windows
}
outfile.close;
array.clear;
array.shrinkTo(0);
infile.close;

-----------------------------------

Thanks.

Feb 11 2021

mw <mingwu gmail.com> writes:

On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
 I'm trying to read in a text file that has many duplicated 
 lines and output a file with all the duplicates removed.

If you only need to remove duplicates, keep (and compare) a 
string hash for each line is good enough. Memory usage should be 
just n x integers.

Feb 11 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Feb 12, 2021 at 01:45:23AM +0000, mw via Digitalmars-d-learn wrote:
 On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
 I'm trying to read in a text file that has many duplicated lines and
 output a file with all the duplicates removed.

 
 If you only need to remove duplicates, keep (and compare) a string
 hash for each line is good enough. Memory usage should be just n x
 integers.

[...]

+1. This can be even done on-the-fly: you don't even need to use .sort
or .uniq.  Just something like this:

	bool[size_t] hashes;
	foreach (line; stdin.byLine) {
		auto h = hashOf(line); // use a suitable hash function here
		if (h !in hashes) {
			outfile.writeln(line);
			hashes[h] = true;
		}
		// else this line already seen before; just skip it
	}

This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't
need to copy the entire content of the file into memory, and also uses
much less memory by storing only hashes.


T

-- 
MASM = Mana Ada Sistem, Man!

Feb 11 2021

frame <frame86 live.com> writes:

On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:

 This turns the OP's O(n log n) algorithm into an O(n) 
 algorithm, doesn't
 need to copy the entire content of the file into memory, and 
 also uses
 much less memory by storing only hashes.

But this kind of hash is maybe insufficient to avoid hash 
collisions. For such big data slower but stronger algorithms like 
SHA are advisable.

Also associative arrays uses the same weak algorithm where you 
can run into collision issues. Thus using the hash from string 
data as key can be a problem. I always use a quick hash as key 
but hold actually a collection of hashes in them and do a lookup 
to be on the safe side.

Feb 11 2021

frame <frame86 live.com> writes:

On Friday, 12 February 2021 at 07:23:12 UTC, frame wrote:
 On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:

 This turns the OP's O(n log n) algorithm into an O(n) 
 algorithm, doesn't
 need to copy the entire content of the file into memory, and 
 also uses
 much less memory by storing only hashes.

 But this kind of hash is maybe insufficient to avoid hash 
 collisions. For such big data slower but stronger algorithms 
 like SHA are advisable.

 Also associative arrays uses the same weak algorithm where you 
 can run into collision issues. Thus using the hash from string 
 data as key can be a problem. I always use a quick hash as key 
 but hold actually a collection of hashes in them and do a 
 lookup to be on the safe side.

Forgot to mention that this kind of solution needs a better 
approach if you don't want to miss a potential different line:

You can use a weak hash but track the line position and count how 
often the same hash occurs as a pre-process. In the post-process 
you look for this lines again and compare if they are really 
identical or hash collisions to correct.

Feb 11 2021

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Feb 12, 2021 at 07:23:12AM +0000, frame via Digitalmars-d-learn wrote:
 On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:
 
 This turns the OP's O(n log n) algorithm into an O(n) algorithm,
 doesn't need to copy the entire content of the file into memory, and
 also uses much less memory by storing only hashes.

 
 But this kind of hash is maybe insufficient to avoid hash collisions.
 For such big data slower but stronger algorithms like SHA are
 advisable.

I used toHash merely as an example. Obviously, you should use a hash
that works well with the input data you're trying to process (i.e.,
minimal chances of collision, not too slow to compute, etc.). SHA hashes
are probably a safe bet, as chances of collision are negligible.


 Also associative arrays uses the same weak algorithm where you can run
 into collision issues. Thus using the hash from string data as key can
 be a problem. I always use a quick hash as key but hold actually a
 collection of hashes in them and do a lookup to be on the safe side.

[...]

You can use a struct wrapper that implements its own toHash method.


T

-- 
Mediocrity has been pushed to extremes.

Feb 12 2021

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 2/11/21 6:22 PM, H. S. Teoh wrote:

 	bool[size_t] hashes;

I would start with an even simpler solution until it's proven that there 
still is a memory issue:

import std.stdio;

void main() {
	bool[string] lines;
	foreach (line; stdin.byLine) {
		if (line !in lines) {
			stdout.writeln(line);
			lines[line.idup] = true;
		}
		// else this line already seen before; just skip it
	}
}

(Grr... Thanks for the tab characters! :p)

Ali

Feb 12 2021

Daniel N <no public.email> writes:

On Saturday, 13 February 2021 at 04:19:17 UTC, Ali Çehreli wrote:
 On 2/11/21 6:22 PM, H. S. Teoh wrote:

 	bool[size_t] hashes;

 I would start with an even simpler solution until it's proven 
 that there still is a memory issue:

 import std.stdio;

 void main() {
 	bool[string] lines;
 	foreach (line; stdin.byLine) {
 		if (line !in lines) {
 			stdout.writeln(line);
 			lines[line.idup] = true;
 		}
 		// else this line already seen before; just skip it
 	}
 }

 (Grr... Thanks for the tab characters! :p)

 Ali

Try this Boost Licensed tool.
https://github.com/eBay/tsv-utils/tree/master/tsv-uniq

Feb 13 2021

tsbockman <thomas.bockman gmail.com> writes:

On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
 I'm trying to read in a text file that has many duplicated 
 lines and output a file with all the duplicates removed. By the 
 end of this code snippet, the memory usage is ~5x the size of 
 the infile (which can be multiple GB each), and when this is in 
 a loop the memory usage becomes unmanageable and often results 
 in an OutOfMemory error or just a complete lock up of the 
 system. Is there a way to reduce the memory usage of this code 
 without sacrificing speed to any noticeable extent? My 
 assumption is the .sort.uniq needs improving, but I can't think 
 of an easier/not much slower way of doing it.

I spent some time experimenting with this problem, and here is 
the best solution I found, assuming that perfect de-duplication 
is required. (I'll put the code up on GitHub / dub if anyone 
wants to have a look.)

--------------------------
     0) Memory map the input file, so that the program can pass 
around slices to it directly
without making copies. This also allows the OS to page it in and 
out of physical memory
for us, even if it is too large to fit all at once.

     1) Pre-compute the required space for all large data 
structures, even if an additional pass is required to do so. This 
makes the rest of the algorithm significantly more efficient with 
memory, time, and lines of code.

     2) Do a top-level bucket sort of the file using a small (8-16 
bit) hash into some scratch space. The target can be either in 
RAM, or in another memory-mapped file if we really need to 
minimize physical memory use.

The small hash can be a few bits taken off the top of a larger 
hash (I used std.digest.murmurhash). The larger hash is cached 
for use later on, to accelerate string comparisons, avoid 
unnecessary I/O, and perhaps do another level of bucket sort.

If there is too much data to put in physical memory all at once, 
be sure to copy the full text of each line into a region of the 
scratch file where it will be together with the other lines that 
share the same small hash. This is critical, as otherwise the 
string comparisons in the next step turn into slow random I/O.

     3) For each bucket, sort, filter out duplicates, and write to 
the output file. Any sorting algorithm(s) may be used if all 
associated data fits in physical memory. If not, use a merge 
sort, whose access patterns won't thrash the disk too badly.

     4) Manually release all large data structures, and delete the 
scratch file, if one was used. This is not difficult to do, since 
their life times are well-defined, and ensures that the program 
won't hang on to GiB of space any longer than necessary.
--------------------------

I wrote an optimized implementation of this algorithm. It's fast, 
efficient, and really does work on files too large for physical 
memory. However, it is complicated at almost 800 lines.

On files small enough to fit in RAM, it is similar in speed to 
the other solutions posted, but less memory hungry. Memory 
consumption in this case is around (sourceFile.length + 32 * 
lineCount * 3 / 2) bytes. Run time is similar to other posted 
solutions: about 3 seconds per GiB on my desktop.

When using a memory-mapped scratch file to accommodate huge 
files, the physical memory required is around 
max(largestBucket.data.length + 32 * largestBucket.lineCount * 3 
/ 2, bucketCount * writeBufferSize) bytes. (Virtual address space 
consumption is far higher, and the OS will commit however much 
physical memory is available and not needed by other tasks.) The 
run time is however long it takes the disk to read the source 
file twice, write a (sourceFile.length + 32 * lineCount * 3 / 2) 
byte scratch file, read back the scratch file, and write the 
destination file.

I tried it with a 38.8 GiB, 380_000_000 line file on a magnetic 
hard drive. It needed a 50.2 GiB scratch file and took about an 
hour (after much optimization and many bug fixes).

Feb 16 2021

tsbockman <thomas.bockman gmail.com> writes:

On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
 On files small enough to fit in RAM, it is similar in speed to 
 the other solutions posted, but less memory hungry. Memory 
 consumption in this case is around (sourceFile.length + 32 * 
 lineCount * 3 / 2) bytes. Run time is similar to other posted 
 solutions: about 3 seconds per GiB on my desktop.

Oops, I think the memory consumption should be (sourceFile.length 
+ 32 * (lineCount + largestBucket.lineCount / 2)) bytes. (In the 
limit where everything ends up in one bucket, it's the same, but 
that shouldn't normally happen unless the entire file has only 
one unique line in it.)

Feb 16 2021

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
 I spent some time experimenting with this problem, and here is 
 the best solution I found, assuming that perfect de-duplication 
 is required. (I'll put the code up on GitHub / dub if anyone 
 wants to have a look.)

It would be interesting to see how the performance compares to 
tsv-uniq 
(https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
prebuilt binaries turn on all the optimizations 
(https://github.com/eBay/tsv-utils/releases).

tsv-uniq wasn't included in the different comparative benchmarks 
I published, but I did run my own benchmarks and it holds up 
well. However, it should not be hard to beat it. What might be 
more interesting is what the delta is.

tsv-uniq is using the most straightforward approach of popping 
things into an associate array. No custom data structures. Enough 
memory is required to hold all the unique keys in memory, so it 
won't handle arbitrarily large data sets. It would be interesting 
to see how the straightforward approach compares with the more 
highly tuned approach.

--Jon

Feb 18 2021

tsbockman <thomas.bockman gmail.com> writes:

On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
 On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
 I spent some time experimenting with this problem, and here is 
 the best solution I found, assuming that perfect 
 de-duplication is required. (I'll put the code up on GitHub / 
 dub if anyone wants to have a look.)

 It would be interesting to see how the performance compares to 
 tsv-uniq 
 (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
 prebuilt binaries turn on all the optimizations 
 (https://github.com/eBay/tsv-utils/releases).

My program (called line-dedup below) is modestly faster than 
yours, with the gap gradually widening as files get bigger. 
Similarly, when not using a memory-mapped scratch file, my 
program is modestly less memory hungry than yours, with the gap 
gradually widening as files get bigger.

In neither case is the difference very exciting though; the real 
benefit of my algorithm is that it can process files too large 
for physical memory. It might also handle frequent hash 
collisions better, and could be upgraded to handle huge numbers 
of very short lines efficiently.

Test 0 (139 MiB, 1537300 unique lines of 2000004):
     sort source | uniq > dest
         Wall time: 2.04 s
         User time: 9.34 s
         System time: 0.23 s
         Max resident set: 385 MiB
     tsv-uniq source > dest
         Wall time: 0.82 s
         User time: 0.73 s
         System time: 0.13 s
         Max resident set: 229 MiB
     line-dedup source dest
         Wall time: 0.58 s
         User time: 0.52 s
         System time 0.05 s
         Max resident set: 206 MiB

Test 1 (1.4 GiB, 14338280 unique lines of 20000004):
     sort source | uniq > dest
         Wall time: 22.9 s
         User time: 107 s
         System time: 2.16 s
         Max resident set: 3.76 GiB
     tsv-uniq source > dest
         Wall time: 9.94 s
         User time: 9.25 s
         System time: 1.3 s
         Max resident set: 2.34 GiB
     line-dedup source dest
         Wall time: 6.34 s
         User time: 5.88 s
         System time 0.44 s
         Max resident set: 1.98 GiB

Test 2 (4.2 GiB, 44379959 unique lines of 60000004):
     sort source | uniq > dest
         Wall time: 87.1 s
         User time: 352.3 s
         System time: 7.41 s
         Max resident set: 7.84 GiB
     tsv-uniq source > dest
         Wall time: 42.3 s
         User time: 40.7 s
         System time: 4.82 s
         Max resident set: 7.86 GiB
     line-dedup source dest
         Wall time: 23 s
         User time: 19.6 s
         System time 1.4 s
         Max resident set: 5.96 GiB

The test files have a fractal distribution, with many lines 
having a few duplicates, and a few lines having many duplicates.

Feb 22 2021

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
 On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt 
 wrote:
 It would be interesting to see how the performance compares to 
 tsv-uniq 
 (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
 prebuilt binaries turn on all the optimizations 
 (https://github.com/eBay/tsv-utils/releases).

 My program (called line-dedup below) is modestly faster than 
 yours, with the gap gradually widening as files get bigger. 
 Similarly, when not using a memory-mapped scratch file, my 
 program is modestly less memory hungry than yours, with the gap 
 gradually widening as files get bigger.

 In neither case is the difference very exciting though; the 
 real benefit of my algorithm is that it can process files too 
 large for physical memory. It might also handle frequent hash 
 collisions better, and could be upgraded to handle huge numbers 
 of very short lines efficiently.

Thanks for running the comparison! I appreciate seeing how other 
implementations compare.

I'd characterize the results a differently though. Based on the 
numbers, line-dedup is materially faster than tsv-uniq, at least 
on the tests run. To your point, it may not make much practical 
difference on data sets that fit in memory. tsv-uniq is fast 
enough for most needs. But it's still a material performance 
delta. Nice job!

I agree also that the bigger pragmatic benefit is fast processing 
of files much larger than will fit in memory. There are other 
useful problems like this. One I often need is creating a random 
weighted ordering. Easy to do for data sets that fit in memory, 
but hard to do fast for data sets that do not.

--Jon

Feb 22 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Trying to reduce memory usage