15 July, 2014

containerisation of my own environment

I've encountered docker in a couple of work-related projects, and for a bit more experimentation I've begun containerising a chunk of my own infrastructure.

Previously I've had a few servers around which over the years have ended up with me being a bit scared to upgrade: too many dependencies for what should be separate services. Last time I rebuilt my main server, I looked at Xen but that machine was a little bit too out of date for doing decent virtualisation: having a bunch of different VMs was at the time the best way I could see for doing this.

I've ended up with an LDAP server for unix accounts, and a /home shared between all the containers that need home directory access (which is done with docker volume mounts at the moment, but there is scope for adding NFS onto that if/when I spread to more than one machine).

I've got separate containers for each of: inbound smtp, outbound smtp, an apache proxy that redirects to other containers based on URL, imap, webmail, ldap server, ssh server (so you are ssh-ing into a container, not the base OS).

The plan is that each of these is built and restarted automatically every week or so; along with a whole machine reboot at least once a month. I'm hoping that keeps stuff fairly up to date and helps me discover upgrade-related breakages around the time they happen rather than years later. It also forces me to pay attention to documenting how I set something up: all the stuff that is torn down and rebuilt each time needs to be documented properly in machine-readable form, so that the rebuild works. In that sense, it is a bit like automated testing of documentation.

I've also tried to set up things like port forwarding and http forwarding so that its not too reliant on using docker - so that I can spread onto other machines or use different virtualisation. That is, for example, how I intend to deal with upgrading the base OS in a few years time - by starting a new VM, moving the services across one by one, and then killing the old one.

08 July, 2014

balancing tests between 4 workers by external observation

We have a bunch of functional tests that are run by our buildbot every time someone makes a git push.

These tests were taking a long time. Obvious answer: run them in parallel.

This brought up some problems, the biggest of which was how to split the tests between the four test runners. Naively chopping the test list into four pieces turned out to be pretty imbalanced: one test runner was taking 4 minutes, another was taking 14 minutes.

Initially this sounded like something a task farm would be good for. But I didn't want to get into digging round in the groovy test running code, and making the test runners talk back to the buildbot to collect tasks.

So I took a less balancing approach with a simpler interface: A balance program picks some tests for each of the four test runners, runs the tests, takes the time of the whole run for each runner, and then iteratively updates its knowledge so that it will hopefully pick a more balanced distribution next time round.

I had a quick play with genetic algorithms but that didn't seem to be going anywhere. Next I implemented this model:

Assume there is a startup cost k, and for each test i, a time t_i that the test takes to run. These cannot directly be measured by the balance program.

Keep an estimate of k and t_i in a state file.

Make a distribution of tests over the four runners based on the estimates.

When each runner finishes, if it took longer than the estimate, nudge up the scores on k and the tests that where on that runner; similarly nudge down if the run time was less.

Run this lots of times.

After a while this rearranges the tests so that each runner takes about 10 minutes each (compared to the 4 .. 14 minutes with a naive distribution)

So we've saved a few minutes on the tests and are hopefully in a position where as we get more tests we can scale up the number of runners and still preserve reasonable balance.

This also copes with converging to a new balance when tests are added/removed; or when test time changes (either due to the test itself changing, or the behaviour being tested)

(The other problem was that it turns out that loads of our tests were secretly dependent on each other and failed when run in different order - this would occasionally cause problems in the naive distribution but was much more of a problem with the reordering that happens with this balancing approach)

Source code is https://github.com/benclifford/weightbal

01 July, 2014

fascist push master

I've been working with a customer to get some development infrastructure and procedures in place.

A problem we had was that, although we had got development branches, some developers were surpremely confident that their change definitely wouldn't break things, and so would regularly put some of their commits directly on the master branch; and then a few minutes later, the build robot would discover that things were broken on master. This was (is) bad because we were trying to keep master always ready to deploy; and trying to keep it so other developers could always base their new work against the latest master.

So we invented a new policy, and a tool to go with it, the fascist-push-master.

Basically, you can only put something on master if it has already passed the build robot's tests (for example, on a development branch).

This doesn't mean "your dev branch passes so now you're allowed to merge it to master" because often breakages happen because of the merge of "independent" work. Instead it means, that master can only be updated to a git commit that has actually passed the tests: you have to merge master into your dev branch, check it works ok there, and then that is what becomes the new master commit.

I wasn't so mean as to make the git repository reject unapproved commits - if someone is so disobedient as to ignore the policy, they can still make whatever commits they want to master. Rejection only happens in the client side tool. This was a deliberate decision.

The merge-to-master workflow previously looked like this:

git checkout mybranch
# do some work
git commit -a -m "my work"
git checkout master
git merge mybranch
git push origin master
# now the build robot tests master, with mybranch changes added

but now it looks like this:

git checkout mybranch
# do some work
git commit -a -m "my work"
git merge master
git push origin mybranch
# now the build robot tests mybranch with master merged in
fascist-push-master
# which will either push this successfully tested
# mybranch to master or fail

The output looks like this:

$ fascist-push-master
You claim ba5810edfc67be5118be6c02ab3ffbe215bbe898 on branch mybranch is tested and ready for master push
Checking that story with build robot.
Umm. Liar. Go merge master to your branch, get jenkins to test it, and come back when it works.

INFORMATION ABOUT YOUR LIES: BRANCHID=mybranch
INFORMATION ABOUT YOUR LIES: COMMITID=ba5810edfc67be5118be6c02ab3ffbe215bbe898

The script is pretty short, and pasted below:

#!/bin/bash

export BRANCHID=$(git rev-parse --abbrev-ref HEAD)
export COMMITID=$(git rev-parse HEAD)

echo You claim $COMMITID on branch $BRANCHID is tested and ready for master push

echo Checking that story with winnie.
ssh buildrobot.example.com grep $COMMITID /var/lib/jenkins/ok-commits
RES=$?

if [ "$RES" == "0" ]; then
  echo OK, you appear to be telling the truth.
  (git fetch && git checkout master && git merge --ff-only origin/master && git merge --ff-only $COMMITID && git push origin master && git checkout $BRANCHID) || echo "SOMETHING WENT WRONG. CONTACT A GROWN UP"
else
  echo Umm. Liar. Go merge master to your branch, get jenkins to test it, and come back when it works.
  echo 
  echo INFORMATION ABOUT YOUR LIES: BRANCHID=$BRANCHID
  echo INFORMATION ABOUT YOUR LIES: COMMITID=$COMMITID
  exit 1
fi