Sometimes its been desirable to de-dupe a large collection of data files.
I wrote a tool called lsdupes to do that.
Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.
To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.
It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.
So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.