www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - File size

reply harakim <harakim gmail.com> writes:
I have been doing some backups and I wrote a utility that 
determines if files are an exact match. As a shortcut, I check 
the file size. So far so good on this with millions of files 
until I found something odd: getSize() and DirEntry's .size are 
producing different values.

This is the relevant code:
```
	if (sourceFile.size != getSize(destinationFilename)) {
		if (getSize(sourceFile.name) != getSize(destinationFilename))
			writeln("Also did not match");
		else
			writeln("Did match so this is odd");

		return ArchivalStatus.SizeDidNotMatch;
	}
```

Whereas before it just returned SizeDidNotMatch, now it also 
prints "Did match so this is odd".

It seems really odd that getSize(sourceFile.name) is returning a 
different number than sourceFile.size. This is an external HDD on 
windows formatted in ntfs that it is reading. I believe I 
originally wrote the files to the file system in Windows, but 
then today I cut and paste them (in the same drive) in Linux. 
However, this is the first time this has happened after millions 
of comparisons and it only happened for about 6 files. It does 
happen consistently though.

I have verified that the file size is that reported by getSize 
and not sourceFile.size and that the files open correctly.

This is my compiler version:
DMD32 D Compiler v2.104.2-dirty

If this is actually a problem and I'm not missing something, I 
would not mind trying to fix this whenever I have some time.
Aug 21 2023
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Monday, 21 August 2023 at 07:52:28 UTC, harakim wrote:
 I have been doing some backups and I wrote a utility that 
 determines if files are an exact match. As a shortcut, I check 
 the file size. So far so good on this with millions of files 
 until I found something odd: getSize() and DirEntry's .size are 
 producing different values.
 
 ...
 
 It seems really odd that getSize(sourceFile.name) is returning 
 a different number than sourceFile.size. This is an external 
 HDD on windows formatted in ntfs that it is reading. I believe 
 I originally wrote the files to the file system in Windows, but 
 then today I cut and paste them (in the same drive) in Linux. 
 However, this is the first time this has happened after 
 millions of comparisons and it only happened for about 6 files. 
 It does happen consistently though.

 I have verified that the file size is that reported by getSize 
 and not sourceFile.size and that the files open correctly.

 ...
Can you print some of the wrong sizes? D's DirEntry iteration code just calls `FindFirstFileW`/`FindNextFileW`, so this *shouldn't* be a D-specific issue, and it should be possible to reproduce this in C.
Aug 21 2023
next sibling parent harakim <harakim gmail.com> writes:
On Monday, 21 August 2023 at 11:05:36 UTC, FeepingCreature wrote:
 Can you print some of the wrong sizes? D's DirEntry iteration 
 code just calls `FindFirstFileW`/`FindNextFileW`, so this 
 *shouldn't* be a D-specific issue, and it should be possible to 
 reproduce this in C.
Yes! I will get that information tomorrow.
Aug 21 2023
prev sibling parent reply harakim <harakim gmail.com> writes:
On Monday, 21 August 2023 at 11:05:36 UTC, FeepingCreature wrote:
 Can you print some of the wrong sizes? D's DirEntry iteration 
 code just calls `FindFirstFileW`/`FindNextFileW`, so this 
 *shouldn't* be a D-specific issue, and it should be possible to 
 reproduce this in C.
Thanks for the suggestion. I was working on getting the list for you when I decided to first try and reproduce this on Linux. I was not able to do so. Then I opened the Linux File Explorer and went to one of the files. There were two files by that name, with names differing only by case. In windows, I only saw one, because Windows Explorer only supports one file with an identical case-insensitive name per directory. Unsurprisingly, that is also the one that was selected by getSize(filename). The underlying windows functions must ignore case as well and select the same way as Explorer (which makes sense). That explains why Windows Explorer reported the same size as getsize(name) in every case, while DirEntry.size would match for the file with the same case as windows recognized and not for the file with a different case. I was able to get into this state because I copied the files (merged directories) in Linux. It was interesting to look into. It seems everything is working as designed. It shouldn't be an issue for me going forward either as I move more and more towards Linux.
Aug 22 2023
parent reply FeepingCreature <feepingcreature gmail.com> writes:
On Tuesday, 22 August 2023 at 16:22:52 UTC, harakim wrote:
 On Monday, 21 August 2023 at 11:05:36 UTC, FeepingCreature 
 wrote:
 Can you print some of the wrong sizes? D's DirEntry iteration 
 code just calls `FindFirstFileW`/`FindNextFileW`, so this 
 *shouldn't* be a D-specific issue, and it should be possible 
 to reproduce this in C.
Thanks for the suggestion. I was working on getting the list for you when I decided to first try and reproduce this on Linux. I was not able to do so. Then I opened the Linux File Explorer and went to one of the files. There were two files by that name, with names differing only by case. In windows, I only saw one, because Windows Explorer only supports one file with an identical case-insensitive name per directory. Unsurprisingly, that is also the one that was selected by getSize(filename). The underlying windows functions must ignore case as well and select the same way as Explorer (which makes sense). That explains why Windows Explorer reported the same size as getsize(name) in every case, while DirEntry.size would match for the file with the same case as windows recognized and not for the file with a different case. I was able to get into this state because I copied the files (merged directories) in Linux. It was interesting to look into. It seems everything is working as designed. It shouldn't be an issue for me going forward either as I move more and more towards Linux.
That's hilarious! I'm happy you found it.
Aug 23 2023
parent harakim <harakim gmail.com> writes:
On Wednesday, 23 August 2023 at 08:48:26 UTC, FeepingCreature 
wrote:
 That's hilarious! I'm happy you found it.
Me too! Thanks for the support. (PS I've already reformatted that drive to ext4.)
Aug 24 2023