www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Phobos function to check if files are identical?

reply XavierAP <n3minis-git yahoo.es> writes:
It's not easy to do by hand of course, but I was wondering if 
there was one simple function taking two file names and just 
returning a bool or something like that. I haven't found it in 
std.file.

If such a function doesn't exist in Phobos but there's a good 
implementation in some other library, I'm interested to know. 
Although this time it's for a unit test so I'd rather implement 
it in two lines than add a dependency.

And otherwise to write it by hand, how do you think is the best 
way? And in terms of performance? By chunks in case of a binary 
comparison? And what about the case of a text comparison?

Thanks
Mar 13
next sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Mon, Mar 13, 2017 at 04:50:49PM +0000, XavierAP via Digitalmars-d-learn
wrote:
 It's not easy to do by hand of course, but I was wondering if there
 was one simple function taking two file names and just returning a
 bool or something like that. I haven't found it in std.file.
Why it is not easy to do by hand? All you have to do is open the two files, then iterate over their data and compare. Of course, you'd want to chunk them up to minimize I/O roundtrips, but it'd be something along the lines of: bool isEqual(string filename1, string filename2) { import std.algorithm.comparison : equal; import std.range : zip; import std.stdio : File, chunks; auto f1 = File(filename1); auto f2 = File(filename2); size_t blockSize = 4096; // or something similar return f1.chunks(blockSize).equal(f2.chunks(blockSize)); }
 If such a function doesn't exist in Phobos but there's a good
 implementation in some other library, I'm interested to know. Although
 this time it's for a unit test so I'd rather implement it in two lines
 than add a dependency.
 
 And otherwise to write it by hand, how do you think is the best way?
 And in terms of performance? By chunks in case of a binary comparison?
 And what about the case of a text comparison?
[...] Binary comparison is easy. Just read the files by fixed-sized chunks and compare them. Text comparison is a can of worms. What kind of comparison do you have in mind? Case-sensitive or insensitive? Is it ASCII or Unicode? What kind of Unicode encoding is involved? Are the files expected to be in one of the normalization forms, or is it free-for-all? Do you expect grapheme equivalence or just character equivalence? Depending on the answer to these questions, text comparison can range from trivial to hair-raisingly complex. But the fundamental ideas remain the same: read a stream of characters / graphemes from each file in tandem, preferably buffered, and compare them using some given comparison function. P.S. I just realized that std.stdio.chunks() doesn't return a range. Bah. File an enhancement request. I might even submit a PR for it. ;-) T -- There are four kinds of lies: lies, damn lies, and statistics.
Mar 13
next sibling parent XavierAP <n3minis-git yahoo.es> writes:
On Monday, 13 March 2017 at 17:47:09 UTC, H. S. Teoh wrote:
 Why it is not easy to do by hand?
Sorry typo, I had intended to type "I know it is easy"
Mar 13
prev sibling next sibling parent XavierAP <n3minis-git yahoo.es> writes:
On Monday, 13 March 2017 at 17:47:09 UTC, H. S. Teoh wrote:
 Binary comparison is easy. Just read the files by fixed-sized 
 chunks and compare them.
Follow up question... What is the best safe way? Since File.byChunk() is system. Just out of curiosity, I would rather use it and flag my code trusted, although I guess there could be concurrency issues I have to take into account anyway... anything else?
Mar 13
prev sibling parent reply Andrea Fontana <nospam example.com> writes:
On Monday, 13 March 2017 at 17:47:09 UTC, H. S. Teoh wrote:
 	bool isEqual(string filename1, string filename2) {
 		import std.algorithm.comparison : equal;
 		import std.range : zip;
 		import std.stdio : File, chunks;

 		auto f1 = File(filename1);
 		auto f2 = File(filename2);

 		size_t blockSize = 4096; // or something similar

 		return f1.chunks(blockSize).equal(f2.chunks(blockSize));
 	}
First I would check if the files have different size or if they are the same file (same path, symlink, etc).
Mar 14
parent reply XavierAP <n3minis-git yahoo.es> writes:
On Tuesday, 14 March 2017 at 08:12:16 UTC, Andrea Fontana wrote:
 First I would check if the files have different size or if they 
 are the same file (same path, symlink, etc).
Good idea. Good reason to have it in std.file. There might also be platform dependent shortcuts?
Mar 14
parent reply flamencofantasy <flamencofantasy gmail.com> writes:
On Tuesday, 14 March 2017 at 08:31:20 UTC, XavierAP wrote:
 On Tuesday, 14 March 2017 at 08:12:16 UTC, Andrea Fontana wrote:
 First I would check if the files have different size or if 
 they are the same file (same path, symlink, etc).
Good idea. Good reason to have it in std.file. There might also be platform dependent shortcuts?
import std.mmfile; auto f1 = new MmFile("file1"); auto f2 = new MmFile("file2"); return f1[] == f2[];
Mar 14
parent XavierAP <n3minis-git yahoo.es> writes:
On Tuesday, 14 March 2017 at 18:26:52 UTC, flamencofantasy wrote:
   import std.mmfile;

   auto f1 = new MmFile("file1");
   auto f2 = new MmFile("file2");

   return f1[] == f2[];
Nice! I don't have experience with memory-mapped files. What are the pros and cons?
Mar 14
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
P.P.S.  It's not overly hard to write an alternative version of
std.stdio.chunks that returns a real range. Something like this should
do:

	// Warning: untested code
	auto realChunks(File f, size_t blockSize)
	{
		static struct Result
		{
			private File f;
			private ubyte[] buffer;
			bool empty = true;
			ubyte[] front;

			this(File _f, size_t blockSize)
			{
				f = _f;
				buffer.length = blockSize;
				empty = false;
				popFront();
			}
			void popFront()
			{
				front = f.rawRead(buffer);
				if (front.length == 0)
					empty = true;
			}
		}
		return Result(f, blockSize);
	}


T

-- 
A program should be written to model the concepts of the task it performs
rather than the physical world or a process because this maximizes the
potential for it to be applied to tasks that are conceptually similar and, more
important, to tasks that have not yet been conceived. -- Michael B. Allen
Mar 13
prev sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Mon, Mar 13, 2017 at 10:47:09AM -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
[...]
 P.S. I just realized that std.stdio.chunks() doesn't return a range.
 Bah. File an enhancement request. I might even submit a PR for it. ;-)
[...]
 P.P.S.  It's not overly hard to write an alternative version of
 std.stdio.chunks that returns a real range.
[...] Bah, I'm an idiot. Just use File.byChunk instead of .chunks. Here's a fully-working example: import std.stdio; bool equal(File f1, File f2) { import std.algorithm.comparison : equal; enum bufSize = 4096; return f1.byChunk(bufSize).equal(f2.byChunk(bufSize)); } int main(string[] args) { if (args.length < 3) { stderr.writeln("Please specify filenames"); return 1; } if (equal(File(args[1]), File(args[2]))) writeln("Files are identical."); else writeln("Files differ."); return 0; } T -- INTEL = Only half of "intelligence".
Mar 13