Recording for Posterity

This is a simple mantra I’ve used in many teams I worked in or led over the years:

Prefer Services to Software
Make services that are robust and functional.
Buy services that other folk do better than we could for the time and/or money.
Prefer Software to People
Automate everything where possible.
Output actionable telemetry for all the things.
Prefer People to Bureaucracy
Trust in the people you’ve employed to do the right thing and do it well. Remove unnecessary paperwork and processes whenever you can.
Prefer ChatOps for Everything
Email is so 1990’s. Put everything on Slack so everyone can see it.
Make command and control of your systems intuitive, accessible and secure.

Traildash on ECS

I’ve recently had some issues where I’ve had to investigate the AWS API usage on one of our accounts. Enabling Cloudtrail is a start but all it does is shove a load of gzipped json files into an S3 bucket which is no use if you actually want to make use of the data.

Enter Traildash a self contained ELK stack on a Docker image which will pull that bucket’s contents into Elasticsearch and display it usefully in a Kibana frontend.

“Docker?”, I thought to myself, “Doesn’t AWS have something that could do that?”. Of course it does. ECS, the EC2 Container Service, allows you to run your own docker cluster. So here’s how to set up Traildash on ECS

First of all you need to follow the instructions in the Traildash readme to setup traildash in AWS. This gets your Cloudtrail up and running and connected to SQS/SNS for pushing to Traildash. The one thing different is the IAM role which will be the ECS instance and service roles created as part of the cluster build below. If you already have running ECS instances, add the “SQS full access” and “S3 Read Only Access” managed policies to your ECS Instance and ECS Service IAM roles. If not, wait until your ECS cluster instances are built and add the policies after.

Next is creating a task and a cluster in ECS.

  • If you’ve not used ECS before, follow the default set up wizard to get to the point where you have running cluster instances, otherwise you should be able to use your existing ECS configuration.
  • Create a New Task Definition in your ECS console.
  • Create a new volume (name doesn’t matter) and give it the source path /var/lib/elasticsearch/appliedtrust/traildash
  • Then Add Container with the following settings:
    • Image: appliedtrust/traildash
    • Port mappings: Host: 7000 Container: 7000
    • Environment Variables:
      • AWS_SQS_URL <your SQS URL>
      • AWS_REGION <your AWS Region>
      • AWS_ACCESS_KEY_ID <your AWS key>
      • AWS_SECRET_ACCESS_KEY <your AWS secret>
    • Mount point:
      • Source volume: <your previously created volume>
      • Container path: /home/traildash
  • Create the Service with your task definition, a single task and an ELB if you want one. (You may need to edit the instance/ELB security group to allow port 7000 access. )
  • Run the task and enjoy your Cloudtrail-y goodness on <your instance or ELB URL>:7000/#/dashboard/file/default.json

It may take 10-15 minutes for your data to start to appear.

If you have Cloudtrail already working in your account and the data has been building up for a while, Traildash provides a backfill script to get it into your dashboard. In order to use the backfill script I changed it to use my aws credentials profile name:

#!/usr/bin/env python

import json
from os import environ

import boto3
boto3.setup_default_session( region_name='eu-west-1', profile_name='<your credentials profile name>')

AWS_S3_BUCKET= "<your bucket name>"
AWS_SQS_URL = "<your SQS url>"

bucket = boto3.resource('s3').Bucket(AWS_S3_BUCKET)
queue = boto3.resource('sqs').Queue(AWS_SQS_URL)

items_queued = 0
for item in bucket.objects.all():
    if not item.key.endswith('.json.gz'):

            'Message': json.dumps({
                's3Bucket': AWS_S3_BUCKET,
                's3ObjectKey': [item.key]
    items_queued += 1

print('Done! {} items were backfilled'.format(items_queued))

And you’re done. Enjoy your useful Cloudtrail data! 7000 2016-03-11 08-12-23

How to be a Good Tech Lead

Transitioning from engineering to management/leadership is tough unless you understand the sacrifices you have to make. Rarely can you keep your hands in the guts of the engineering you used to know intimately. And you’ll have to do more paperwork. That said it can be just as rewarding to see your team grow in expertise and experience because of your leadership.

Here are a few simple rules to start with:

  1. If there’s an exciting fun task and a messy unpleasant task, assign the fun task to someone else and do the unpleasant task yourself.
  2. If someone on your team wants to ask you a question, always make yourself available and absolutely pretend that you don’t mind being interrupted. But if you need to ask someone on your team a question, always ask first if it is a good time for them to talk and offer to come back later if they are in the middle of something.
  3. If someone wants to try an approach that you think is wrong, say: “I’m not sure that’s the right approach, because of X, Y, and Z. However, I’ve been wrong before, so I might be wrong about this, too. How long will it take you to research this approach and see if it works out?” If you’re working on a tight schedule, this may not be practical, but if you want to develop good engineers in the long run, this can be beneficial for everyone.
  4. Be humble. Redirect upstream praise for your team’s work onto your team directly (away from yourself). Accept criticism for your team’s work directly onto yourself.
  5. Expect to do less actual engineering, but still keep on top of one or two components (DB, CM, CD etc) for up to 1/3 of your time. This helps to maintain an ear-to-the-ground on ongoing work and to communicate intelligently with the technical team.

Another good resource is this article from David Loftesness, formerly Twitters Director of Engineering:

Loftesness’ 90-day framework has three distinct stages: Own Your Education (Days 1-30), Find Your Rhythm (Days 31-60) and Assessing Yourself (Days 61-90). It also helps with the decision to enter management in the first place.

The Operations Documentation Problem

Documentation tends to be a polarising, all or nothing, topic in the Operations teams I have been a part of. Everyone agrees on its fundamental importance but no one seems to like to spend time or effort producing it. Especially if they see no immediate benefit in it for themselves.

“My code is self documenting”

“It’s all in the readme”


“That guy knows – he built it”

This, of course, becomes a real issue when bringing new staff into the team. Onboarding takes far longer when the majority of the information that new hires require about the systems they will be administering has to be acquired piecemeal, anecdotally, using imperfect recollection and without the benefit of an index or search facility.

Storage, input, display and availability are also all contentious subtopics. People have their own favourite wiki flavours, editors and formatting techniques. Should the content be made available outside the team? Or to third parties e.g. vendors? Or even outside the company?

My current role had just such a problem. Documentation was in an awful state – exported in plain text from an old Trac wiki – partially converted to Github markdown in a private repo. A disorganised, incomplete process started by an engineer who had since left the company.

Our Issues

Numerous problems were apparent when comparing the old version of the docs:

  • Trac formatting, although similar to Markdown, was incompatible and hence content was mainly unreadable
  • Inter-document links were broken and unformatted
  • Images were not displaying properly
  • Much of the content was outdated, obsolete or plain wrong

How Much Doco Is Too Much?

Legend states that minimum viable documentation for a relatively complex software application should consist of:

  • How to install
  • How to create and ship a change
  • A Project roadmap
  • A Changelog
  • A Glossary – if necessary
  • How to troubleshoot or Where to get help

And for Open Source projects:

  • How to contribute

Much of this was lacking for nearly all the projects in our document repository.

So where do you start?

Start by being systematic. Some considerations:

Get some consensus on where the docs should live

Local file server? Github? Dedicated wiki instance?

Public or private?

Private to the team/group/department/company? Public to the world? Or by invitation e.g. username/pass?


Will you need authentication to edit? Or to view?


Do you need group collaboration on documents? Or will a page be locked to one user?

The Winning

Fortunately the total amount of documentation we had to manage wasn’t too great, maybe a few hundred pages. So we settled on keeping the docs in Github and using a wiki front end for reading and searching.

First stop was MDwiki which is a nice, simple single file wiki front end. It is ideal for small amounts of documentation or single project wiki sites such as when documentation is included in a Github repo. However we had a significant amount of pages in nested directories. And MDwiki has no search facility.

So we went with Gollum

Gollum wiki

Gollum is the front end used within Github for it’s repo wikis. So when you click on the link in the right menu, Gollum is the engine organising and serving you the pages.

As a standalone package it works really well to display and make searchable any documentation you throw at it.

As it says in the Gollum documentation it supports a number of preformatted file types:

  • ASCIIDoc: .asciidoc
  • Creole: .creole
  • Markdown: .markdown, .mdown, .mkdn, .mkd, .md
  • Org Mode: .org
  • Pod: .pod
  • RDoc: .rdoc
  • ReStructuredText: .rest.txt, .rst.txt, .rest, .rst
  • Textile: .textile
  • MediaWiki: .mediawiki, .wiki

And Gollum also allows you to register your own extensions and parsers:

Gollum::Markup.register(:angry, "Angry") do |content|

In my next post I’ll go into how we modified Gollum wiki to work for our documentation process.

Further Notes on SaltStack Monitoring

A few weeks ago I started looking at SaltStack, my current config management package choice, as the central component of an open source componentised monitoring package.

This is now up and running in a rudimentary fashion. I have a scheduler state that is applied to several machines in my estate which sends system monitoring data to both a MySQL instance for storage and reuse and to a Graphite endpoint for display.

I’m also looking at some other components:

I will continue to work on supplementing the data the systems sends back and on the front end display. Once I have something worth looking at, I may even post some screenshots.

Elasticsearch on AWS or How I learned to stop worrying and love the lucene index

Ahh, Elasticsearch, the cause of, and solution to, all of lifes problems.

I run a Logstash/Elasticsearch/Kibana cluster on EC2 as a application/system log aggregator for the web service I’m supporting. And it’s not been plain sailing. I have a limited AWS budget so I am somewhat restricted in the instances I can fire up. No cc2.8xlarges for me. So I was stuck with two m1.larges. And they struggled.

It was processing around 3 million documents for a total index size of around 4Gb per day. And sometimes it coped and sometimes it didn’t. I often found myself restarting the logstash and elasticsearch services around once or twice a week sometimes losing 7-9 hours of processed logs.

And the most frustrating thing? I had no idea what I was doing wrong. Had I misconfigured? Is it just that the instances were too small?

So I’ve upped my game a bit. Not without some trial and error. “Fake it til you make it” as those of us without an extensive background in Lucene indices and grid clustering are fond of saying.

But I think I’ve cracked it. And this may be a good lesson for people starting out with a set up like this.

  • I’ve now got two c3.xlarges, which with 10 more compute units to play which makes a big difference to the throughput.
  • I’ve tweaked the Logstash command line to give me 8 filter workers instead of the default 1. Helps a lot when the document volume increases.

And the most important thing? I’ve done my homework and put some effort into making my Elasticsearch config right.

  • Port specification to prevent mismatch

transport.tcp.port: 9301

  • EC2 discovery plugin with filtering to ensure the instances see each other and increased ping timeout to account for network irregularities.
    type: ec2
    groups: elasticsearch-node
    ping_timeout: 65s
        Elasticsearch: true
  • Making sure my nodes are given specific workloads using SaltStack jinja templating of the config .yml
{{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} if grains['elasticsearch_master'] == False {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}
    node.master: false true
{{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} endif {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}

For now my problems seem to be mitigated. We’ll see how easy it is in future to scale the service as my user load increases.

Saltstack Monitoring

I spoke recently at the Februrary London DevOps meetup about my adventures with solo DevOps and the SaltStack config management system ( Slides )

One of the other talks at the meetup, Stop Using Nagios by Andy Sykes from Forward3D (@supersheep) got me thinking about using Salt as the core component of a distributed monitoring system. I believe it fits the mould very well:

  • It has an established, secure and most importantly, fast, master-minion setup
  • It has built in scheduling capability both for the master and separately for the minions.
  • It already has built in support for piping whatever comes back from your minion status checks into Graphite for graphing and MySQL/Postgres/SQlite or Cassandra/Mongo/Redis/Couch for storage/trending etc in its Returners
  • It can act on event on the minions using it’s Events and Reactor systems

Some other interesting work has already been done in prototypes of Salt modules to run NPRE checks on minions.

And there are plenty of Graphite Dashboards that could be co-opted and built upon to provide other views of check data, not to mention Salt’s experimental Halite which may have possibilities as another UI facet.

I’ve started doing some testing of my own on this, but I’d be very interested in feedback.

CloudWatchr for AWS Instances

For some time I’ve been using Cloudwatch as a supplement to my other graphing and monitoring packages, but I finally got tired of the poor UI and lack of customisation. I had seen and tested some other GitHub releases that seemed like they may do the trick to allow me to run my own CW graphs but none had the features I required. So i wrote my own.

Based on aws-cloudchart I have built a set of tools that will give a much better at-a-glance insight into your AWS EC2, RDS and ELB instances.

Take a look here:

The code should be fairly easy to follow and it uses the older AWS PHP-SDK v 1.62 as it’s base interface with the Amazon Cloudwatch API. And all it needs is a set of IAM creds. Feel free to fork and improve.