Sometimes its been desirable to de-dupe a large collection of data files.
I wrote a tool called lsdupes to do that.
Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.
To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.
It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.
So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.
Showing posts with label hash. Show all posts
Showing posts with label hash. Show all posts
10 April, 2010
20 March, 2010
so many ways to hash
I was making commandline tools for stupid to drive the example sha256 code, resulting in multiple tools that deliberately did the same but using different language backends. Then I realised I have a shitload of (where shitload==4) md5sum commandline tools already:
$ echo abc | md5 0bee89b07a248e27c83fc3d5951213c1 $ echo abc | gmd5sum 0bee89b07a248e27c83fc3d5951213c1 - $ echo abc | openssl dgst 0bee89b07a248e27c83fc3d5951213c1 $ echo abc | gpg2 --print-md md5 0B EE 89 B0 7A 24 8E 27 C8 3F C3 D5 95 12 13 C1
Subscribe to:
Posts (Atom)