Ben Clifford Technical Blog

06 July, 2015

A Haskell reddit bot.

I am one of many many moderators on reddit's r/LondonSocialClub. This is a place for organising social gatherings in London.

Post titles usually take the form [DD/MM/YY] Event @ Place. Other moderators have fiddled with the CSS for this subreddit to give us a big red TODAY sticker next to today's events, and grey out events that are in the past. This uses reddit's flair mechanism, which allows assigning of labels to posts, and CSS styling based on a post's flair.

Unfortunately, this was not entirely automated - some sucker or other had to go in each day and adjust flair on the relevant posts to match up with reality. This bothered me as being a manual process that should be fairly easily automated. Eventually it bothered me enough that I wrote a bot, lsc-todaybot, to do it. Now the moderation logs make it look like I come home from the pub every day and move everything around before going to sleep.

Another motivation for writing this bot was it seemed small enough in scope that it would be achievable, but give me a chance to learn a few new APIs: several new Haskell libraries, and the reddit REST API.

HTTP: I've previously used HTTP when hacking at cabal. This doesn't do HTTPS (I think) and the maintainer told me to not use it. So I tried wreq. It was easy enough to get going and there was a tutorial for me to rip off.

Configuration: I used yaml to parse a YAML configuration file.

Lenses: I still haven't got a good grasp on what is happening with lenses but I used them in a few places, and it has developed my understanding a little bit: lsc-todaybot extracts fields from reddit's JSON responses using aeson-lens. yaml exposes the parsed configuration file as JSON, so the same lenses can be used for extracting configuration details. wreq also uses lenses for setting HTTP header values and the like.

Strings: I seem to have ended up using several different string classes, which is icky - ByteString, Text and String at least. I've made the source code for that more generic by using the generic monoid <> operator to concatenate them which makes things a bit less horrible looking.

28 May, 2015

10 minute Haskell talk: An awkward interaction between lazy ByteStrings and a misbehaving (non-)transparent HTTP middlebox

The slides for a lightning talk I gave at the London Haskell User Group are here. Press a in the browser and you'll get some explanatory notes with the slides; otherwise they're a bit sparse.

01 December, 2014

dive computer subtitles for gopro videos

On a couple of dives recently, I had my own dive computer and wore my GoPro in head-mounted mode.

I thought it would be nice to have the dive computer info displayed on the GoPro video, so I hacked up https://github.com/benclifford/subsurface2srt which pulls data from subsurface and makes it into a VLC subtitles file.

One problem I have is that both the GoPro and the dive computer have manually set clocks, which can be set only to the nearest minute. So guessing a start offset between the video and the dive computer file is a bit hazy.

04 November, 2014

plane wifi

I was on a plane that had wifi for the first time. I think a 777-200 or something like it.

I didn't have much battery power left on my laptop and I didn't want to pay USD16 for just a few minutes; but I did have a poke around the network.

My laptop could see 2 access points with ESSID United_Wi-Fi and 10 with a blank ESSID.

I connected to one of the United_Wi-Fi APs.

They used NAT (I expect) and allocated me an RFC1918 address in subnet with about 500 usable IPs.

inet addr:172.19.248.97  Bcast:172.19.249.255  Mask:255.255.254.0

With each passenger carrying at least one wifi device, I wonder if they'll get near address space exhaustion. A 777 is supposed to be able to carry up to about 450 passengers in some configurations.

The default gateway is down at 172.19.248.1

There is a suggestion that DNS paywall tunnel hacks might work, though I didn't try - some hostname lookups gave me an IP address, and some gave an NXDOMAIN which suggests there is some off-plane communication happening even though the paywall was still in place.

$ host www.google.com
www.google.com has address 74.125.225.51
[...]
$ host blahfkskfdhs.com
Host blahfkskfdhs.com not found: 3(NXDOMAIN)

http GETs were all redirected to www.unitedwifi.com, hosted on-plane at 172.19.248.2.

An nmap of the 172.19.248.0/23 subnet gave 19 addresses responding to pings - I guess mostly passengers, but I guess crew too, and servers/routers.

The three interesting nmap results were:

Nmap scan report for ns.unitedwifi.com (172.19.248.1)
Host is up (0.0020s latency).
Not shown: 997 filtered ports
PORT    STATE  SERVICE
53/tcp  open   domain
80/tcp  open   http
443/tcp closed https
MAC Address: 00:0D:2E:00:40:01 (Matsushita Avionics Systems)

Nmap scan report for www.unitedwifi.com (172.19.248.2)
Host is up (0.0014s latency).
Not shown: 993 filtered ports
PORT      STATE  SERVICE
80/tcp    open   http
443/tcp   open   https
8080/tcp  closed http-proxy
16001/tcp closed fmsascon
16012/tcp closed unknown
16016/tcp closed unknown
16018/tcp closed unknown
MAC Address: 00:0D:2E:00:00:A8 (Matsushita Avionics Systems)

Nmap scan report for 172.19.248.3
Host is up (0.0019s latency).
Not shown: 999 filtered ports
PORT   STATE SERVICE
53/tcp open  domain
MAC Address: 00:0D:2E:00:40:01 (Matsushita Avionics Systems)

I didn't probe any more as my battery had run out.

30 September, 2014

Payment Wristband on the London Underground

I previously blogged about making a paytag sticker into a wristband. Later Barclays Bank released a variation: bpay, a prepay mastercard already in a wristband.

The wristband holder is pretty shitty and falloffable: it is bulky and I know two people (one being myself) who have lost their bands accidentally. I've rehoused mine on a woven bracelet.

Being a pre-pay band, this chip does an online authorisation for every transaction, making it sometimes a little slower. But for the same reason, they expose authorisations (not just cleared transactions) in their live online statement.

I recently made my first journey on the London Overground using bpay (I've been on their contactless payment trial for 6 months but using a different card) and I got to see an initial authorisation that I hadn't seen before with my previous (post-paid) card:

0908 Enter train system at Wapping station

0915 bpay sees this authorization:

     Auth: TfL Travel Charge,TFL.gov.uk/CP,GB  29/09/2014  9:14:50  Posted On: 29/09/2014  GBP 0.10

0922 Leave train system at Shoreditch High Street

then around close of business on day+1, that Auth gets replaced with the actual charge:

    Fin: TFL.GOV.UK/CP,VICTORIA,TFL TRAVEL C   30/09/2014  18:07:58  Posted On: 29/09/2014  GBP 7.20

Interesting that they charge 10p for authorization rather than the minimum single fare. Also note that the description of the transaction changes (to something less readable, IMO) - that seems to happen from other merchants too. Weirdos.

02 September, 2014

Boris bike tidal flow

Docking status information is available in XML for the London bike hire scheme ("Boris bikes").

I made this video (AVI) (animated GIF) of tidal flow as areas get busy or empty during the day, an animated version of the image below using data from Saturday evening until Tuesday lunchtime.

Each point represents a docking station. You can see how the shape of this cloud sits over London on this Google map. Blue means empty docking station. Red means full docking station. Light blue and light red mean almost empty and almost full, respectively.

Not so much on Saturday and Sunday, but clearly (to me) on Monday you can see a 9am rush hour move of bikes into the centre, and a 5pm move of bikes back out to the edges again.

27 August, 2014

ffmpeg X video capture...

It turns out ffmpeg can video-record an X server. I'm using this to capture video of a set of web browsers running tests inside Xvfb virtual frame buffers.

ffmpeg -f x11grab -s 1024x768 -r 4 -i :1 -sameq screencast.flv &
VIDEOPID=$!
xeyes # or some other X-based automated testing program
kill $VIDEOPID

05 August, 2014

ping reverse dns

Slightly unexpected hostname lookup on a CNAME.

maven.op is a CNAME to lulu; the reverse DNS points only to lulu.

Someone has already figured out the lulu hostname by the first line of output because it's shown there. But the ping lines for a few seconds show the name I gave and then turn to the "real" hostname (perhaps when a reverse DNS happens? rather than using the name that we looked up forward-wise to begin with?)

No big deal, but slightly unexpected.

benc@utsire:~$ ping maven.ops.xeus.co.uk
PING lulu.xeus.co.uk (46.4.100.47) 56(84) bytes of data.
64 bytes from maven.ops.xeus.co.uk (46.4.100.47): icmp_req=1 ttl=51 time=466 ms
64 bytes from maven.ops.xeus.co.uk (46.4.100.47): icmp_req=2 ttl=51 time=51.1 ms
64 bytes from maven.ops.xeus.co.uk (46.4.100.47): icmp_req=3 ttl=51 time=51.9 ms
64 bytes from lulu.xeus.co.uk (46.4.100.47): icmp_req=4 ttl=51 time=60.1 ms
64 bytes from lulu.xeus.co.uk (46.4.100.47): icmp_req=5 ttl=51 time=96.5 ms
64 bytes from lulu.xeus.co.uk (46.4.100.47): icmp_req=6 ttl=51 time=50.9 ms
64 bytes from lulu.xeus.co.uk (46.4.100.47): icmp_req=7 ttl=51 time=49.5 ms
64 bytes from lulu.xeus.co.uk (46.4.100.47): icmp_req=8 ttl=51 time=50.6 ms

benc@utsire:~$ ping -V
ping utility, iputils-sss20101006

15 July, 2014

containerisation of my own environment

I've encountered docker in a couple of work-related projects, and for a bit more experimentation I've begun containerising a chunk of my own infrastructure.

Previously I've had a few servers around which over the years have ended up with me being a bit scared to upgrade: too many dependencies for what should be separate services. Last time I rebuilt my main server, I looked at Xen but that machine was a little bit too out of date for doing decent virtualisation: having a bunch of different VMs was at the time the best way I could see for doing this.

I've ended up with an LDAP server for unix accounts, and a /home shared between all the containers that need home directory access (which is done with docker volume mounts at the moment, but there is scope for adding NFS onto that if/when I spread to more than one machine).

I've got separate containers for each of: inbound smtp, outbound smtp, an apache proxy that redirects to other containers based on URL, imap, webmail, ldap server, ssh server (so you are ssh-ing into a container, not the base OS).

The plan is that each of these is built and restarted automatically every week or so; along with a whole machine reboot at least once a month. I'm hoping that keeps stuff fairly up to date and helps me discover upgrade-related breakages around the time they happen rather than years later. It also forces me to pay attention to documenting how I set something up: all the stuff that is torn down and rebuilt each time needs to be documented properly in machine-readable form, so that the rebuild works. In that sense, it is a bit like automated testing of documentation.

I've also tried to set up things like port forwarding and http forwarding so that its not too reliant on using docker - so that I can spread onto other machines or use different virtualisation. That is, for example, how I intend to deal with upgrading the base OS in a few years time - by starting a new VM, moving the services across one by one, and then killing the old one.

08 July, 2014

balancing tests between 4 workers by external observation

We have a bunch of functional tests that are run by our buildbot every time someone makes a git push.

These tests were taking a long time. Obvious answer: run them in parallel.

This brought up some problems, the biggest of which was how to split the tests between the four test runners. Naively chopping the test list into four pieces turned out to be pretty imbalanced: one test runner was taking 4 minutes, another was taking 14 minutes.

Initially this sounded like something a task farm would be good for. But I didn't want to get into digging round in the groovy test running code, and making the test runners talk back to the buildbot to collect tasks.

So I took a less balancing approach with a simpler interface: A balance program picks some tests for each of the four test runners, runs the tests, takes the time of the whole run for each runner, and then iteratively updates its knowledge so that it will hopefully pick a more balanced distribution next time round.

I had a quick play with genetic algorithms but that didn't seem to be going anywhere. Next I implemented this model:

Assume there is a startup cost k, and for each test i, a time t_i that the test takes to run. These cannot directly be measured by the balance program.

Keep an estimate of k and t_i in a state file.

Make a distribution of tests over the four runners based on the estimates.

When each runner finishes, if it took longer than the estimate, nudge up the scores on k and the tests that where on that runner; similarly nudge down if the run time was less.

Run this lots of times.

After a while this rearranges the tests so that each runner takes about 10 minutes each (compared to the 4 .. 14 minutes with a naive distribution)

So we've saved a few minutes on the tests and are hopefully in a position where as we get more tests we can scale up the number of runners and still preserve reasonable balance.

This also copes with converging to a new balance when tests are added/removed; or when test time changes (either due to the test itself changing, or the behaviour being tested)

(The other problem was that it turns out that loads of our tests were secretly dependent on each other and failed when run in different order - this would occasionally cause problems in the naive distribution but was much more of a problem with the reordering that happens with this balancing approach)

Source code is https://github.com/benclifford/weightbal