Zero Downtime Deployment with AWS ECS and ELB

Florian MotlikAugust 27th, 2015Last Updated: August 26th, 2015

0 115 4 minutes read

As development teams push farther toward continuous delivery, deploying updates to an application without disruption to users is constantly becoming a more sought-after practice. Amazon’s EC2 Container Service helps to make that easier than ever with tight Elastic Load Balancer integration.

Who Needs Zero Downtime Deployment?

The response to that question depends on who you ask. The most common answer used to be global websites with steady traffic twenty-four/seven or high-availability services with Service Level Agreements (SLA) that included guarantees about downtime. Everything that doesn’t fit in that box can theoretically just be deployed after hours with minimal user disruption.

As more teams move to continuous delivery/deployment with an emphasis on fast feedback, the desire grows to be able to deploy multiple times per day, in the middle of the day, and while users are active on the site. Other teams could just value their sleep. Regardless of the reason, deploying without a zero downtime process in the middle of the day will create noticeable outages for your users, damaging their confidence in your site or service. That is bad.

What Does Zero Downtime Deployment Look Like?

At a basic level, a zero downtime deploy involves swapping out servers running new code for servers running the old code on a load balancer. Here is the general scripted process:

Create a new Virtual Machine (VM) image with the new code.
Start a number of VMs using that image, equal to the number currently running.
Verify each of these instances are running correctly and responding to checks.
Add the new instances to the ELB while removing the old (with connection draining).
Verify that everything is working properly. If not swap old back in for the new and diagnose the problem. If so, delete the old instances.

The process will look a little different if database changes are involved. Strict developer policies are needed to manage schema changes specifically to ensure that no deployment breaks the possibility of a rollback. As a general rule, never delete or change a column or table that is currently in use. If there is a problem, you can’t rollback without restoring from backup. Hold off on that change until the following deploy.

Zero downtime database changes are a much more involved topic that can vary simply by database stress level depending on what’s being done. But the general rule of enforcing backwards compatibility over deployments covers most of the bases. As long as nothing will break with old code a new code running side by side for a couple of minutes, you should be in the clear.

How Is It Different with ECS?

ECS doesn’t use individual virtual machines. It uses a cluster of a few to deploy Docker containers to them via task definitions. However, the basic building blocks of a zero downtime deploy are the same. We need to start the new container, verify it’s running, and then swap it out on the load balancer. This is important for a cluster because you have to have enough resources available in the cluster to start the new containers while the others are already running. If the necessary resources aren’t available, you’ll see a note in the events console that looks like this:

service sample-webapp was unable to place a task because the resources could not be found.

If you do have those resources available, you’ll see a set of messages along these lines:

service sample-webapp has started 2 tasks: task TASK-ID-1 task TASK-ID-2.
service sample-webapp registered 2 instances in elb LOAD-BALANCER-NAME
service sample-webapp has begun draining connections on 2 tasks.
service sample-webapp stopped 2 running tasks.
service sample-webapp has reached a steady state.

The messages that you see are the work of the ECS existing integration with Elastic Load Balancer to execute those zero downtime deployments without you needing to intervene. All that’s necessary, if you don’t have the resources available, is to add additional instances to the cluster so that you do. That can be done by changing the desired instances on an autoscaling group or by going directly to EC2 to add more instances to the cluster.

Try It Yourself

If you would like to step through this process yourself with Amazon’s sample-webapp, follow these steps:

First, if you haven’t already, complete the Setting up with Amazon ECS process.
Then step over to the ECS console’s first run wizard.
Step through the ECS Getting Started guide to get the sample web app running.
Hop over to the Task Definition and Create a Revision of the console-sample-app-static.
Edit the JSON for “command” to change the HTML displayed in a noticeable way.
Now go to Cluster, select your cluster.
In the Cluster, click on your Service.
In the Service, click Update and change the task definition to the revision you made in step 4. It will be indicated by the revision number next to it. Then click Update Service.
In the Deployments tab you should be able to see the pending count and running counts change after a few seconds. Feel free to keep refreshing your browser tab that was pointed at the app to watch the transition.
When you’re done, don’t forget to go through the Clean Up process.