Ben Clifford Technical Blog: hash

10 April, 2010

size as a hash function

Sometimes its been desirable to de-dupe a large collection of data files.

I wrote a tool called lsdupes to do that.

Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.

To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.

It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.

So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.

20 March, 2010

so many ways to hash

I was making commandline tools for stupid to drive the example sha256 code, resulting in multiple tools that deliberately did the same but using different language backends. Then I realised I have a shitload of (where shitload==4) md5sum commandline tools already:

$ echo abc | md5
0bee89b07a248e27c83fc3d5951213c1

$ echo abc | gmd5sum
0bee89b07a248e27c83fc3d5951213c1  -

$ echo abc | openssl dgst 
0bee89b07a248e27c83fc3d5951213c1

$ echo abc | gpg2 --print-md md5
0B EE 89 B0 7A 24 8E 27  C8 3F C3 D5 95 12 13 C1

Ben Clifford Technical Blog

10 April, 2010

size as a hash function

20 March, 2010

so many ways to hash

Blog Archive

Labels