Immutable Servers With Packer and Puppet
Lately I’ve been becoming more and more of a fan of is the concept of Immutable Servers while automating our infrastructure at Zapier. The concept is simple: never do server upgrades or changes on live servers, instead just build out new servers with applied updates and throw away the old ones. You basically get all the benefits of immutability in programming at the infrastructure level plus you never have to worry about configuration drift. And even better, I no longer have to have the fear that despite extensive tests someone might push a puppet manifest change that out the blue breaks our front web servers (sure we can rollback the changes and recover, but there is still a small potential outage to worry about).
Obviously you need some good tooling to make this happen. Some recent fooling around with packer has allowed me to put together a setup that I’ve been a little pleased with so far.
The Nodes
In our infrastructure project we have a nodes.yaml that defines node names and the AWS security groups they belong to. This is pretty straightforward and used for a variety of other tools (for example, vagrant).
elasticsearch: group: logging zookeeper: group: zookeeper redis: group: redis size: m2.2xlarge
The Rakefile
We use this nodes.yaml file with rake to produce packer templates to build out new AMIs. This keeps me from having to manage a ton of packer templates as they mostly have the same features.
require 'erb' require 'yaml' namespace :packer do task :generate do current_dir = File.dirname(__FILE__) nodes = YAML.load_file( "#{current_dir}/nodes.yml") nodes.each_key do |node_name| include ERB::Util template = File.read("#{current_dir}/packs/template.json.erb") erb = ERB.new(template) File.open("#{current_dir}/packs/#{node_name}.json", "w") do |f| f.write(erb.result(binding)) end end end end
This is used in conjunction with a simple erb template that simply injects the nodename into it.
{ "builders": [{ "type": "amazon-ebs", "region": "us-east-1", "source_ami": "ami-10314d79", "instance_type": "t1.micro", "ssh_username": "ubuntu", "ami_name": "<%= node_name %> {{.CreateTime}}", "security_group_id": "packer" }], "provisioners": [{ "type": "shell", "script": "packs/install_puppet.sh" }, { "type": "shell", "inline": [ "sudo apt-get upgrade -y", "sudo sed -i /etc/puppet/puppet.conf -e \"s/nodename/<%= node_name %>-$(hostname)/\"", "sudo puppet agent --test || true" ] }]
This will generate a packer template for each node that will
- Create an AMI in us-east-1
- Uses an Ubuntu Server 13.04 AMI to start with
- Sets the security group to packer in EC2. We create this and allow it access to puppetmaster’s security group. Otherwise packer will create a random temporary security group that won’t have access to any other groups (if you follow best practices at least)!
- installs puppet
- Runs puppet once to configure the system
We also never enable puppet agent (it defaults to not starting) so that it never polls for updates. We could also remove puppet from the server after it completes so the AMI doesn’t have it baked in.
The Script
Packer has a nice feature of enabling the user to specify shell commands and shell files to run. This is fine for bootstrapping but not so fine for doing the level of configuration management that puppet is more suited for. So our packer templates call a shell script that makes sure we don’t use the age old version of ruby linux distros love to default to and installs puppet. As part of the installation it also specifies the puppet master server name (if you’re using VPC instead of EC2 classic, you don’t need this as you can just assign the internal dns “puppet” to puppetmaster).
sleep 30, wget http://apt.puppetlabs.com/puppetlabs-release-raring.deb sudo dpkg -i puppetlabs-release-precise.deb sudo apt-get update sudo apt-get remove ruby1.8 -y sudo apt-get install ruby1.9.3 puppet -y sudo su -c 'echo """[main] logdir=/var/log/puppet vardir=/var/lib/puppet ssldir=/var/lib/puppet/ssl rundir=/var/run/puppet factpath=$vardir/lib/facter templatedir=$confdir/templates [agent] server = ip-10-xxx-xx-xx.ec2.internal report = true certname=nodename""" >> /etc/puppet/puppet.conf'
Building It
Now all we need to do to build out a new AMI for redis is run packer build packs/redis.json
and boom! A server is created, configured, imaged and terminated. Now just set up a few jobs in jenkins to generate these based on certain triggers and you’re one step closer to automating your immutable infrastructure.
Cleaning Up
Of course, each AMI you generate is going to cost you a penny a day or some such. This might seem small, but once you have 100 revisions of each AMI it’s going to cost you! So as a final step I whipped up a simple fabfile script to cleanup the old images. This proved to be a simple task because we include a unix timestamp in the AMI name.
import os import boto from fabric.api import task class Images(object): def __init__(self, **kwargs): self.conn = boto.connect_ec2(**kwargs) def get_ami_for_name(self, name): (keys, AMIs) = self.get_amis_sorted_by_date(name) return AMIs[0] def get_amis_sorted_by_date(self, name): amis = self.conn.get_all_images(filters={'name': '{}*'.format(name)}) AMIs = {} for ami in amis: (name, creation_date) = ami.name.split(' ') AMIs[creation_date] = ami # remove old ones! keys = AMIs.keys() keys.sort() keys.reverse() return (keys, AMIs) def remove_old_images(self, name): (keys, AMIs) = self.get_amis_sorted_by_date(name) while len(keys) > 1: key = keys.pop() print("deregistering {}".format(key)) AMIs[key].deregister(delete_snapshot=True) @task def cleanup_old_amis(name): ''' Usage: cleanup_old_amis:name={{ami-name}} ''' images = Images( aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'] ) images.remove_old_images(name)
Set this up as a post-build job to the jenkins job that generates the AMI and you always ensure you have only the latest one. You could probably also tweak this to keep the last 5 AMIs around too for archiving purposes.
What’s Next?
I admit I’m still a little fresh with this concept. Ideally I’d be happy as hell to get our infrastructure to the point where each month (or week!) servers get recycled with fresh copies. Servers that are more transient like web servers or queue works this is easy. With data stores this can be a little more trickier as you need an effective strategy to boot up replicas of primary instances, promote replicas to primaries and retire the old primaries.
A final challenge is deciding what level of mutability is allowed. Deployments are obviously fine as they don’t tweak the server configuration but what about adding / removing users? Do we take an all or nothing approach or allow tiny details like SSH public keys to be updated without complete server rebuilds?
What does this do to long-term statistics gathering? Presumably newly-built servers get new names, and over time this adds up to lots of servers. How do you graph web server CPU over the course of a year in this scenario?
– Henry
What we typically do is bucket these together under the applications they run. In the monitoring software we use we wind up with the average across the servers.