Sometimes its been desirable to de-dupe a large collection of data files.
I wrote a tool called lsdupes to do that.
Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.
To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.
It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.
So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.
2.5 years too late for you, but fdupes does exactly that.
ReplyDeleteyeah, deleting duplicate files is a microcosm of "lets reimplement this"...
ReplyDeletethough really what I thought was cool was the utility of size-as-hash: everything I read about hashes is about making more secure, more unique(?) hashes that are cheap to implement; rather than uber-cheap, very shitty hashes.
A different very cheap hash might be "checksum of the first 4096 bytes of the file".