10 April, 2010

size as a hash function

Sometimes its been desirable to de-dupe a large collection of data files.

I wrote a tool called lsdupes to do that.

Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.

To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.

It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.

So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.

2 comments:

  1. 2.5 years too late for you, but fdupes does exactly that.

    ReplyDelete
  2. yeah, deleting duplicate files is a microcosm of "lets reimplement this"...

    though really what I thought was cool was the utility of size-as-hash: everything I read about hashes is about making more secure, more unique(?) hashes that are cheap to implement; rather than uber-cheap, very shitty hashes.

    A different very cheap hash might be "checksum of the first 4096 bytes of the file".

    ReplyDelete