Ben Clifford Technical Blog: April 2010

24 April, 2010

google chart venn diagram of disk usage

This compares disk usage of a backup I made of /home two years ago with one made this month. There's a lot of overlap - more than I expected.

Method: creating the second backup, I used rsync with --link-dest specified; for measuring, I used du -hsc old new; du -hsc new old. The size of the overlap is the size of new given by the second du minus the size of new given by the first du.

I'm trying to figure out ways of showing overlapping disk usage for more than two or three rsync trees created in this way (for example, to visualise monthly, weekly and daily backups over a long time period)

17 April, 2010

versions from version control systems

Sometimes its nice for built software to contain version information from the version control system - this means less to users and more to developers.

This is useful when people, for whatever reason, incorrectly report their version: "please test the release candidate for version 1.0 // I found bug B // That's strange because I thought bug B only got introduced yesterday in the dev code // oh yes, well I checked out the dev code and tested that, not actually version 1.0"; or "I have a new bug C // Are you building from a pristine source checkout? // Yes // Why does your log file contain a warning message that the source was modified? // oh, well I hacked at component Q // yes, that's why you're seeing bug C."

Different version control systems have different version representations. SVN has a sequential revision number; git has directory tree hashes; darcs doesn't seem to have any explicit version representation.

So here is code I've used at different times for svn, git and darcs:

For darcs, here's code I found in darcs issue tracker 1142:

echo "$(darcs show tags | head -1) (+ $(darcs changes --count --from-tag .) patches)"

which gives the most recent tag in the repository and how many additional patches have been added:

v1.1 (+ 39 patches)

For svn, we used something like this in Swift

R=$(svn info | grep '^Revision' | sed "s/Revision: /$1-r/")
M=$(svn status | grep --invert-match '^\?' > /dev/null && echo "($1 modified locally)")
echo $R $M

which we use to form version strings like this:

swift-r1833 (swift modified locally)

Using git-svn, so that I was interested in the SVN version information rather than git version information, swap out R and M above for these:

    R=$(git svn info | grep '^Revision' | sed "s/Revision: /$1-r/")
    if git status -a >/dev/null ; then
      M="($1 modified locally)"
    fi

Both of the above two return the version of the last change in a particular checkout. When there are many modules, its sometimes desirable to give version information for those modules independently. The svnversion command that gives that:

svnversion -c somecomponent

that will produce different output forms that vary depending on the state of the checkout but at simplest gives something like:

To get similar per-directory handling for git, you can do something like this anywhere except in the root of a repository:

SUBDIRPREFIX=$(git rev-parse --show-prefix | sed 's%/$%%')
cd ./$(git rev-parse --show-cdup)
git ls-tree HEAD $SUBDIRPREFIX | (read a b c d ; echo $c)

which will emit a tree hash.

So there are a bunch of different techniques for different version control systems. It would be nice if they were all a bit more consistent - really cool would be a single command that knew all version control systems and could output in a consistent format.

10 April, 2010

size as a hash function

Sometimes its been desirable to de-dupe a large collection of data files.

I wrote a tool called lsdupes to do that.

Originally, it was going to run md5sum to hash every file, and then list files with matching hashes to be deleted.

To optimise it a bit, I put in a size check: a first round happens looking at the size of each file, and only when there are multiple files for a given size will md5sums be computed. I though this would the md5sum load a bit.

It turns out that in my photo collection of 44590 photos and videos, the size is a perfect hash - there are no photos with the same size that have different content.

So while md5sum does get run on the dupes, it doesn't find anything different from what the size-based hash does; and the size-based hash runs a lot faster: to walk the tree and get sizes takes around 15 seconds. To then take md5sums of the roughly 5000 files that have non-unique size takes around 7 minutes.

03 April, 2010

HTML forms in the early 90s

Before HTML forms used the <form> tag, there was a simpler mechanism called <ISINDEX>. I was reminded about this because someone asked why CGI examples showed urls without name=value pairs in the ?query section, like this: http://ive.got.syphil.is/disease-registry.html?pox

I then went to see if <ISINDEX> still worked in modern browsers. I discovered that in
Safari, it renders a box, but there seems to be no way to submit the query (traditionally, pressing enter would submit it). Lynx and w3m still support it.

From others, I hear:

apparently the default browser on the g1 doesn't even recognize isindex as some sort of input field, although it renders it as such.
sort of weird; won't let you type in it but it renders an input box

and Chrome is also reported to work correctly.

Anyway, this post has an isindex tag right here:

<--- there, so you can see how it works in your browser...

Ben Clifford Technical Blog