Category Archives: Uncategorised

The Operations Documentation Problem

Documentation tends to be a polarising, all or nothing, topic in the Operations teams I have been a part of. Everyone agrees on its fundamental importance but no one seems to like to spend time or effort producing it. Especially if they see no immediate benefit in it for themselves.

“My code is self documenting”

“It’s all in the readme”


“That guy knows – he built it”

This, of course, becomes a real issue when bringing new staff into the team. Onboarding takes far longer when the majority of the information that new hires require about the systems they will be administering has to be acquired piecemeal, anecdotally, using imperfect recollection and without the benefit of an index or search facility.

Storage, input, display and availability are also all contentious subtopics. People have their own favourite wiki flavours, editors and formatting techniques. Should the content be made available outside the team? Or to third parties e.g. vendors? Or even outside the company?

My current role had just such a problem. Documentation was in an awful state – exported in plain text from an old Trac wiki – partially converted to Github markdown in a private repo. A disorganised, incomplete process started by an engineer who had since left the company.

Our Issues

Numerous problems were apparent when comparing the old version of the docs:

  • Trac formatting, although similar to Markdown, was incompatible and hence content was mainly unreadable
  • Inter-document links were broken and unformatted
  • Images were not displaying properly
  • Much of the content was outdated, obsolete or plain wrong

How Much Doco Is Too Much?

Legend states that minimum viable documentation for a relatively complex software application should consist of:

  • How to install
  • How to create and ship a change
  • A Project roadmap
  • A Changelog
  • A Glossary – if necessary
  • How to troubleshoot or Where to get help

And for Open Source projects:

  • How to contribute

Much of this was lacking for nearly all the projects in our document repository.

So where do you start?

Start by being systematic. Some considerations:

Get some consensus on where the docs should live

Local file server? Github? Dedicated wiki instance?

Public or private?

Private to the team/group/department/company? Public to the world? Or by invitation e.g. username/pass?


Will you need authentication to edit? Or to view?


Do you need group collaboration on documents? Or will a page be locked to one user?

The Winning

Fortunately the total amount of documentation we had to manage wasn’t too great, maybe a few hundred pages. So we settled on keeping the docs in Github and using a wiki front end for reading and searching.

First stop was MDwiki which is a nice, simple single file wiki front end. It is ideal for small amounts of documentation or single project wiki sites such as when documentation is included in a Github repo. However we had a significant amount of pages in nested directories. And MDwiki has no search facility.

So we went with Gollum

Gollum wiki

Gollum is the front end used within Github for it’s repo wikis. So when you click on the link in the right menu, Gollum is the engine organising and serving you the pages.

As a standalone package it works really well to display and make searchable any documentation you throw at it.

As it says in the Gollum documentation it supports a number of preformatted file types:

  • ASCIIDoc: .asciidoc
  • Creole: .creole
  • Markdown: .markdown, .mdown, .mkdn, .mkd, .md
  • Org Mode: .org
  • Pod: .pod
  • RDoc: .rdoc
  • ReStructuredText: .rest.txt, .rst.txt, .rest, .rst
  • Textile: .textile
  • MediaWiki: .mediawiki, .wiki

And Gollum also allows you to register your own extensions and parsers:

Gollum::Markup.register(:angry, "Angry") do |content|

In my next post I’ll go into how we modified Gollum wiki to work for our documentation process.

Further Notes on SaltStack Monitoring

A few weeks ago I started looking at SaltStack, my current config management package choice, as the central component of an open source componentised monitoring package.

This is now up and running in a rudimentary fashion. I have a scheduler state that is applied to several machines in my estate which sends system monitoring data to both a MySQL instance for storage and reuse and to a Graphite endpoint for display.

I’m also looking at some other components:

I will continue to work on supplementing the data the systems sends back and on the front end display. Once I have something worth looking at, I may even post some screenshots.

Elasticsearch on AWS or How I learned to stop worrying and love the lucene index

Ahh, Elasticsearch, the cause of, and solution to, all of lifes problems.

I run a Logstash/Elasticsearch/Kibana cluster on EC2 as a application/system log aggregator for the web service I’m supporting. And it’s not been plain sailing. I have a limited AWS budget so I am somewhat restricted in the instances I can fire up. No cc2.8xlarges for me. So I was stuck with two m1.larges. And they struggled.

It was processing around 3 million documents for a total index size of around 4Gb per day. And sometimes it coped and sometimes it didn’t. I often found myself restarting the logstash and elasticsearch services around once or twice a week sometimes losing 7-9 hours of processed logs.

And the most frustrating thing? I had no idea what I was doing wrong. Had I misconfigured? Is it just that the instances were too small?

So I’ve upped my game a bit. Not without some trial and error. “Fake it til you make it” as those of us without an extensive background in Lucene indices and grid clustering are fond of saying.

But I think I’ve cracked it. And this may be a good lesson for people starting out with a set up like this.

  • I’ve now got two c3.xlarges, which with 10 more compute units to play which makes a big difference to the throughput.
  • I’ve tweaked the Logstash command line to give me 8 filter workers instead of the default 1. Helps a lot when the document volume increases.

And the most important thing? I’ve done my homework and put some effort into making my Elasticsearch config right.

  • Port specification to prevent mismatch

transport.tcp.port: 9301

  • EC2 discovery plugin with filtering to ensure the instances see each other and increased ping timeout to account for network irregularities.
    type: ec2
    groups: elasticsearch-node
    ping_timeout: 65s
        Elasticsearch: true
  • Making sure my nodes are given specific workloads using SaltStack jinja templating of the config .yml
{{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} if grains['elasticsearch_master'] == False {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}
    node.master: false true
{{466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac} endif {466614b89e1585428fc59ecdf288199e1bb262d0e643d13002ca39401be135ac}}

For now my problems seem to be mitigated. We’ll see how easy it is in future to scale the service as my user load increases.

Saltstack Monitoring

I spoke recently at the Februrary London DevOps meetup about my adventures with solo DevOps and the SaltStack config management system ( Slides )

One of the other talks at the meetup, Stop Using Nagios by Andy Sykes from Forward3D (@supersheep) got me thinking about using Salt as the core component of a distributed monitoring system. I believe it fits the mould very well:

  • It has an established, secure and most importantly, fast, master-minion setup
  • It has built in scheduling capability both for the master and separately for the minions.
  • It already has built in support for piping whatever comes back from your minion status checks into Graphite for graphing and MySQL/Postgres/SQlite or Cassandra/Mongo/Redis/Couch for storage/trending etc in its Returners
  • It can act on event on the minions using it’s Events and Reactor systems

Some other interesting work has already been done in prototypes of Salt modules to run NPRE checks on minions.

And there are plenty of Graphite Dashboards that could be co-opted and built upon to provide other views of check data, not to mention Salt’s experimental Halite which may have possibilities as another UI facet.

I’ve started doing some testing of my own on this, but I’d be very interested in feedback.

CloudWatchr for AWS Instances

For some time I’ve been using Cloudwatch as a supplement to my other graphing and monitoring packages, but I finally got tired of the poor UI and lack of customisation. I had seen and tested some other GitHub releases that seemed like they may do the trick to allow me to run my own CW graphs but none had the features I required. So i wrote my own.

Based on aws-cloudchart I have built a set of tools that will give a much better at-a-glance insight into your AWS EC2, RDS and ELB instances.

Take a look here:

The code should be fairly easy to follow and it uses the older AWS PHP-SDK v 1.62 as it’s base interface with the Amazon Cloudwatch API. And all it needs is a set of IAM creds. Feel free to fork and improve.