Autoscaling on Complex Telemetry
Autoscaling to meet system demand is one of the most important techniques for cloud deployments. Too many machines waste money, too few machines lose it. Autoscaling effectively is hugely important for a business as system size increases, but doing so is a challenge, owing to the nature of the tools available to us as engineers.
In this article, I will discuss autoscaling techniques and their drawbacks, and propose a technique that is suited to systems with complex function. I’ll also focus on and use the terminology of Amazon Web Services. Other clouds offer similar auto-scaling capability, so information presented here should be transferable.
How Should I Think About Autoscaling?
Just to be sure we’re on the same page, what is it we’re doing when we enable autoscaling for a system?
Obviously we’re saving money by turning off machines we don’t need (or making money by turning on those we do need), but what is autoscaling apart from business concerns? What does this thing look like viewed purely with an engineer’s eyes?
First, it’s a complication. A system that uses a constant, well-known number of machines is less complex. At any given time you, the engineer, can say how many machines are in service and know what the utilization of the system should be.
Now, implicit in this is the assumption that you’re measuring the system’s utilization, have a reasonable knowledge of what it is through time from repeat observation, and have allocated enough machines to handle some percentile of that utilization. This is resource planning.
Static deployment has two glaring flaws from an engineering standpoint. In the event your system is driven above its load threshold, unfortunate and often unexpected things occur. Additionally, except in rare cases, the system utilization will change through time. Perhaps your business becomes more successful and the load on the system creeps higher and higher as traffic increases.
Both are maintenance burdens and require ongoing monitoring and occasional human intervention.
Autoscaling — dynamic allocation of machines — resolves these flaws at the cost of making the system more complex. In particular, once a system is made to autoscale, it is no longer necessarily possible to say without examination how many machines are in operation. This is acceptable if autoscaling can be done with respect to the utilization of the system at any given time.
First Steps: CPU Scaling
CPU consumption is a rough analog of a system’s capability, usually within some margin of error of the “true” utilization and usually a first target when someone begins to autoscale a system.
CPU utilization is a well-understood thing to talk about; there’s no added training required to reason about the behavior of the system, and that’s great for making decisions in a team. Additionally, CPU utilization is effectively free in terms of engineering effort, as this is a default CloudWatch metric.
Unfortunately — except in such cases where your system is almost entirely CPU bound and the IO necessary to shuffle data toward your CPU is negligible — the view of your system’s utilization as seen through the CPU metric can be fairly rough.
I work extensively on Erlang systems. One curious feature of Erlang systems, from an operational perspective, is that the VM threads busy-wait for work. The scheduler threads are effectively doing this:
while (true) { if (have_work) { ... } }
Why? It’s faster to wake up a busy-wait thread.. Erlang is a language designed for soft real-time applications; it makes sense for the VM to prioritize for latency.
The downside here is that a moderately loaded Erlang system will fairly well peg its CPUs. While it might look like the system is struggling if you just go by CPU utilization, in actuality it could be chugging along just fine and keeping latency to a minimum.
The good news here though is that CPU scaling is often conservative; it’ll tend to strike somewhere north of the system’s threshold. This is often good enough when the total number of computers needed are rather small, fewer than ten or so. The system will be over-provisioned but in a way that changes dynamically through time and is effectively free to be getting on with.
Single-Steam Telemetry
What if you have a very large system, one where wringing out a more efficient autoscale strategy is desirable, or where being able to communicate more precisely about the system’s utilization is necessary?
In any moderately complex system, you must have insight into its real-time behavior through a telemetry stream. This is distinct from and complementary to batch-oriented logging of structured data.
If you work in Java, Dropwizard Metrics is a suitable library; in Erlang, exometer. These libraries each have a particular focus that makes them valuable: They aggregate on the machine and ship to external services. Consider this example from the Dropwizard Metrics documentation:
private final Histogram responseSizes = metrics.histogram(name(RequestHandler.class, "response-sizes")); public void handleRequest(Request request, Response response) { // etc responseSizes.update(response.getContent().length); }
Here, metrics
is an instance of a MetricsRegistry
, a store in local memory for telemetry to be aggregated. A MetricsRegistry
can be introspected by user code — which will be important in this post’s next section — and by default there are several Reporter
objects to ship off the contents of a MetricsRegistry
on a periodic interval. Exometer has similar capability.
With a telemetry stream established, you have the ability to measure key system metrics. Exactly what is key is a matter of the engineer’s expert understanding of the system. Say you’re building, oh, I don’t know, a high-scale, low-latency, real-time bidding system. For a start, key metrics here will be:
- transactions per second (TPS)
- latency of transactions at some percentile per second (LPS)
- unchecked exceptions per second (EPS)
CloudWatch has a convenient support for shipping custom metrics into it. Once your metric is in CloudWatch, you can autoscale from it.
At AdRoll we saw, in the transition from CPU scaling to TPS scaling, a 15-percent reduction in machine instances without degradation in service. Why? It turns out, 15 percent was the ‘roughness’ factor of CPU for our system.
While reducing machine counts is cool, a strong additional value was the reorienting of the system’s provisioning back toward something more directly aligned with expert understanding. It became possible to communicate as a team about autoscaling’s effect with regard to our shared domain understanding.
Aggregate Custom Metric
Autoscaling on the back of single-stream telemetry is great but not without an issue. Notice in the previous section I said “key metrics” and listed three. Which should you choose? Let’s walk through a thought exercise.
Say the target for autoscaling is 1000 TPS per machine. At 500 TPS, the autoscale system will remove instances from the autoscale group; at 1500 TPS, it will add them. What happens if the system reports a transaction only if there were no exceptions in its processing? You will under-report ‘real’ transactions and over-provision.
This can be especially hairy if your error rate climbs to near 100 percent of traffic: Autoscaling will then attempt to provision up to the maximum allowed. Not great.
Consider the case where there are no errors, but the latency of processing time grows. This will tend to lower the TPS and have a similar end result to the previous thought experiment. Also not great.
What can be done? As a first step, you might consider autoscaling based on multiple custom metrics. This is possible to do, but I don’t advise it for two reasons. Most important, I think, is that a multi-metric autoscale policy makes communication about its behavior difficult to reason about. “Why did the group scale?” is a very important question, one which should be answerable without elaborate deduction.
Of secondary concern, multiple scaling streams can conflict with one another in contradictory ways. In our second thought experiment, if we were to autoscale up on increased latency, the system could pogo between latency-based scaling and TPS scale events, one fighting the other.
As system domain experts, we know what the key metrics of the system are and how they relate to one another. The trick here is to teach autoscaling this relationship. Here’s how you do it: Combine the metrics into a weighted sum and scale on that. Something like this:
Index = (1.0*TPS + 15.0*EPS) - 2.0*LPS
Exometer makes this particularly convenient. You must create a function that performs the computation:
-module(stat). -export([index/0]). index() -> (per_second(tps) + per_second(eps)) - per_second(lps). per_second(Metric) -> PerMinute = case exometer:get_value(Metric, one) of {ok, [{one, PM}]} -> PM; {error, not_found} -> 0.0 end, PerMinute * (1/60.0) * weight(Metric). weight(tps) -> 1.0; weight(eps) -> 15.0; weight(lps) -> 2.0; weigth(_) -> 1.0.
Once you have this in hand, configure exometer to periodically poll stat:index/0
and report to CloudWatch. Scale on the index, and you’re in business.
An index in a production system with ten weighted factors would not be uncommon, and it’s very convenient to extend the index computation. If you measure it, you can put it in the index.
Conclusion
Scaling based on a weighted sum has every advantage of the single-stream telemetry approach from the last section — efficiency of allocation, efficiency of communication, and simplicity of reasoning — with none of its downsides. An index has the added advantage of being entirely an engineering artifact. Everyone can review the code which produces it, alter it based on evolving need, and gain insight into metrics of central importance to the global health of the system in one sum.
Reference: | Autoscaling on Complex Telemetry from our JCG partner Brian Troutwine at the Codeship Blog blog. |