Microservices for Java Developers: Microservices and fallacies of the distributed computing
1. Introduction
The journey of implementing microservice architecture inherently implies building the complex distributed system. And fairly speaking, most of the real world software systems are far from being simple, but the distributed nature of the microservices amplifies the complexity a lot.
In this part of the tutorial we are going to talk about some of the common traps that many developers could fall into, known as fallacies of distributed computing. All these false assumptions should not mislead us and we are going to spend a fair amount of time talking about different patterns and techniques for building resilient microservices.
Any complex system can (and will) fail in surprising ways … – https://queue.acm.org/detail.cfm?id=2353017
Table Of Contents
2. Local != Distributed
How many times you have been caught by surprise discovering that the invocation of the seemingly innocent method or function causes a storm of remote calls? Indeed, these days most of the frameworks and libraries out there are hiding the genuinely important details behind multiple levels of convenient abstractions, trying to make us believe that there is no difference between local (in-process) and remote calls. But the truth is, network is not reliable and network latency is not equal to zero.
Although most of our topics are going to be centered on traditional request / response communication style, asynchronous message-driven microservices are not hassle-free either. You still have to reach the remote brokers and be ready to deal with idempotency and message de-duplication.
3. SLA
We are going to start off with service-level agreement, or simply SLA. It is very often overlooked subject but each service in your microservices ensemble should better have one defined. It is difficult and thoughtful process, which is unique to the nature of the service in question and should take into account a lot of different constraints.
Why it is so essential? First of all, it gives the development team a certain level of freedom in picking the right technology stack. And secondly, it hints the consumers of the service what to expect in terms of response times and availability (so the consumers could derive own SLAs).
In the next sections we are going to discuss a number of techniques the consumers (which are often the other services) may use to protect themselves from the instability or outages of the services they depend upon.
4. Health Checks
Is there a quick way to check if the service is up and running, even before embarking on a potentially complex business transaction? The health checks are the standard practice for the service to report its readiness to take the work.
All the services of the JCG Car Rentals platform expose the health check endpoints by default. Below, the Customer Service is picked to showcase the health check in action:
01 02 03 04 05 06 07 08 09 10 11 | $ curl http: //localhost :18800 /health { "checks" : [ { "data" : {}, "name" : "db" , "state" : "UP" } ], "outcome" : "UP" } |
As we are going to see later on, the health checks are actively used by infrastructure and orchestration layers to probe the service, alert or/and apply the compensating actions.
5. Timeouts
When one party calls another one over the wire, configuring the proper timeouts (connection, read, write, request, …) is probably the simplest but the most effective strategy to use. We have already seen it in the previous part of the tutorial, here is just a short remainder.
1 2 3 4 5 6 | final CompletableFuture customer = client .setRequestTimeout( 500 ) .setReadTimeout( 100 ) .execute() .toCompletableFuture(); |
When the other side is irresponsive, or communication channels are unreliable, waiting indefinitely in the hope that response may finally come in is not a best option. Now, the question is what the timeouts should be set to? There is no single magic number which works for everyone but the service SLAs we have discussed earlier are the key source of information to answer this question.
Great, so let us assume the right values are in place, but what should the consumer do if the call to the service times out? Unless the consumer does not care about the response anymore, the typical strategy in this case is to retry the call. Let us talk about that for a moment.
6. Retries
From the consumer perspective, retrying the request to the service in case of intermittent failures is the easiest thing to do. For these purposes, the libraries like Spring Retry, failsafe or resilience4j are of great help, offering a range of retry and back-off policies. For example, the snippet below demonstrates the Spring Retry approach.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 | final SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy( 5 ); final ExponentialBackOffPolicy backOffPolicy = new ExponentialBackOffPolicy(); backOffPolicy.setInitialInterval( 1000 ); backOffPolicy.setMaxInterval( 20000 ); final RetryTemplate template = new RetryTemplate(); template.setRetryPolicy(retryPolicy); template.setBackOffPolicy(backOffPolicy); final Result result = template.execute( new RetryCallback<Result, IOException>() { public Result doWithRetry(RetryContext context) throws IOException { // Any logic which needs retry here return ...; } }); |
Besides these general-purpose one, most of the libraries and frameworks have own built-in idiomatic mechanism to perform retries. The example below comes from the Spring Reactive WebClient we have touched upon in the previous part of the tutorial.
01 02 03 04 05 06 07 08 09 10 11 12 | final WebClient client = WebClient .builder() .clientConnector( new ReactorClientHttpConnector(httpClient)) .baseUrl(“http: //localhost:8080/api/customers”) .build(); final Mono customer = client .get() .uri( "/{uuid}" , uuid) .retrieve() .bodyToMono(Customer. class ) .retryBackoff( 5 , Duration.ofSeconds( 1 )); |
The importance of the back-off policy rather than fixed delays should not be neglected. The retry storms, better known as thundering herd problem, are often causing the outages since all the consumers may decide to retry the request at the same time.
And last but not least, one serious consideration when using any retry strategy is idempotency: the preventing measures should be taken both from consumer side and service side to make sure there are no unexpected side-effects.
7. Bulk-Heading
The concept of bulkhead is borrowed from the ship building industry and found its direct analogy in software development practices.
Bulkheads are used in ships to create separate watertight compartments which serve to limit the effect of a failure – ideally preventing the ship from sinking. – https://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html
Although we are not building ships but software, the main idea stays the same: minimize the impact of the failures in the applications, ideally preventing them from crashes or becoming irresponsive. Let us discuss a few scenarios where bulkheading manifests itself, especially in microservices.
The Reservation Service, part of the JCG Car Rentals platform, might be asked to retrieve all reservations for a particular customer. To do that, it first consults the Customer Service to make sure the customer exists, and in case of successful response, fetches the available reservations from the underlying data store, limiting the results to first 20 records.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 | @Autowired private WebClient customers; @Autowired private ReservationsByCustomersRepository repository; @Autowired private ConversionService conversion; @GetMapping ( "/customers/{uuid}/reservations" ) public Flux findByCustomerId( @PathVariable UUID uuid) { return customers .get() .uri( "/{uuid}" , uuid) .retrieve() .bodyToMono(Customer. class ) .flatMapMany(c -> repository .findByCustomerId(uuid) .take( 20 ) .map(entity -> conversion.convert(entity, Reservation. class ))); } |
The conciseness of the Spring Reactive stack is amazing, isn’t it? So what could be the problem with this code snippet? It all depends on repository
, really. If the call is blocking, the catastrophe is about to happen since the even loop is going to be blocked as well (remember, the Reactor pattern). Instead, the blocking call should be isolated and offloaded to a dedicated pool (using subscribeOn
).
01 02 03 04 05 06 07 08 09 10 | return customers .get() .uri( "/{uuid}" , uuid) .retrieve() .bodyToMono(Customer. class ) .flatMapMany(c -> repository .findByCustomerId(uuid) .take( 20 ) .map(entity -> conversion.convert(entity, Reservation. class )) .subscribeOn(Schedulers.elastic())); |
Arguably, this is one example of the bulkheading, to use dedicated thread pools, queues, or processes to minimize the impact on the critical parts of application. Deploying and balancing over multiple instances of the service, isolating the tenants in the multitenant applications, prioritizing the request processing, harmonizing the resource utilization between background and foreground workers, this is just a short list of interesting challenges you may run into.
8. Circuit Breakers
Awesome, so we have learned about the retry strategies and bulkheading, we know how to apply these principles to isolate the failures and progressively get the job done. However, our goal is not really that, we have to stay responsive and fulfill the SLA promises. And even if you do not have ones, responding within reasonable time frame is a must. The circuit breaker pattern, popularized by Michael Nygard in the terrific and highly recommended for reading Release It! book, is what we would really need.
The circuit breaker implementation could get quite sophisticated but we are going to focus on its two core features: ability to keep track of the status of the remote invocation and use the fallback in case of failures or timeouts. There are quite a few excellent libraries which provide the circuit breaker implementations. Beside failsafe and resilience4j we have mentioned before, there are also Hystrix, Apache Polygene and Akka. The Hystrix is probably the best known and battle-tested circuit breaker implementation as of today and is the one we are going to use as well.
Getting back to our Reservation Service, let us take a look on how Hystrix could be integrated into the reactive flow.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 | public Flux findByCustomerId( @PathVariable UUID uuid) { final Publisher customer = customers .get() .uri( "/{uuid}" , uuid) .retrieve() .bodyToMono(Customer. class ); final Publisher fallback = HystrixCommands .from(customer) .eager() .commandName( "get-customer" ) .fallback(Mono.empty()) .build(); return Mono .from(fallback) .flatMapMany(c -> repository .findByCustomerId(uuid) .take( 20 ) .map(entity -> conversion.convert(entity, Reservation. class ))); } |
We have not tuned any Hystrix configuration in this example but if you are curious to learn more about the internals, please check out this article.
The usage of the circuit breakers not only helps the consumer to make the smart decisions based on the operational statistics, it also potentially may help the service provider to recover from the intermittent load conditions faster.
9. Budgets
The circuit breakers along with sensitive timeouts and retry strategies are helping your service to deal with failures but they also eat your service SLA budget. It is absolutely possible that when the service has finally gotten all the data it needs to assemble the final response, the other side is not interested anymore and has dropped the connection long ago.
This is difficult problem to solve although there is one quite straightforward technique to apply: consider calculating the approximate time budget the service has while progressively fulfilling the request. Going over the budget should be rather the exception than the rule, but when it happens, you are well prepared by cutting off the throwaway work.
10. Persistent Queues
This is somewhat obvious but if your microservice architecture is built using asynchronous message passing, the queues which store the messages or events must be persistent (and very desirably, replicated). Most of the message brokers we have discussed previously support durable persistent storage out of the box but there are special cases when you may be trapped.
Let us get back to the example of sending the confirmation email upon successful customer registration, which we implemented using asynchronous CDI 2.0 events.
1 2 3 4 5 6 7 | customerRegisteredEvent .fireAsync( new CustomerRegistered(entity.getUuid())) .whenComplete((r, ex) -> { if (ex != null ) { LOG.error( "Customer registration post-processing failed" , ex); } }); |
The problem with this approach is that the event queuing is happening all in memory. If the process crashes before the event gets delivered to the listeners, it is going to be lost forever. Perhaps in case of confirmation email it is not a big deal, but issue is still there.
For the cases when the lost of such events or messages is not desired, one of the options is to use persistent in-process queue, like for example Chronicle Queue. But in the long run using the dedicated message broker or data store might be a better choice overall.
11. Rate Limiters
One of the unpleasant but unfortunately very realistic situations you should prepare your services for is to deal with abusive clients. We would exclude the purposely malicious and DDoS attacks, since those require the sophisticated mitigation solutions. But bugs do happen and even internal consumers may go wild and try to put your service on its knees.
Rate limiting is an efficient technique to control the rate of requests from the particular source and shed the load in case when the limits are violated.
Although it is possible to bake the rate limiting into each service (using, for example, Redis to coordinate all service instances), it makes more sense to offload such responsibility to API gateways and orchestration layers. We are going to get back to this topic in more details later in the tutorial.
12. Sagas
Let us forget about individual microservices for a moment and look at the big picture. The typical business flow is usually a multistep process and relies on several microservices to successfully do their part. The reservation flow which the JCG Car Rentals implements is a good example of that. There are at least three services involved in it:
- the Inventory Service has to confirm the vehicle availability
- the Reservation Service has to check that vehicle is not already booked and make the reservation
- the Payment Service has to process the charges (or refunds)
The flow is a bit simplified but the point is, every step may fail for variety of reasons. The traditional approach the monoliths take is to wrap everything in the huge all-or-nothing database transaction but it is not going to work here. So what are the options?
One of them is to use distributed transaction and two-phase commit protocol, with all the complexity and scalability issues it brings on the table. Another approach, more aligned with the microservice architecture, is to use sagas.
A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule then the saga executes a series of compensating transactions that undo the changes that were made by the preceding local transactions. – https://microservices.io/patterns/data/saga.html
It is very likely that you may need to rely on sagas while implementing business flows spanning multiple microservices. The Axon and Eventuate Tram Saga are two examples of the frameworks which support sagas but the chances to end up in DIY situation are high.
13. Chaos
At this point it may look like building microservices is the fight against chaos: anything anywhere could break and you have to deal with that somehow. In some sense it is true and this is probably why the discipline of chaos engineering was born.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – https://principlesofchaos.org/
The goal of chaos engineering is not to crash the system but make sure that mitigation strategies work and reveal the problems if any. In the part of the tutorial dedicated to testing we are going to spend some time discussing the faults injection but if you are curious to learn more right away, please check out this great introductory article.
14. Conclusions
In this part of the tutorial we have talked about the importance of thinking about and mitigating failures while implementing microservice architecture. Network is unreliable and staying resilient and responsive should be among the core guiding principles to follow by each microservice in your fleet.
We have covered the set of generally applicable techniques and practices but this is just a tip of the iceberg. The advanced approaches like, for example, Java GC pause detection or load balancing, left out of scope in our discussion however the upcoming parts of the tutorial will dive into some of those.
To run a bit ahead, it is worth to mention that a lot of concerns which used to be the responsibility of the applications are moving up to the infrastructure or orchestration layers. Still, it is valuable to known such problems exist and how to deal with them.
15. What’s next
In the next part of the tutorial we are going to talk about security and secret management, exceptionally important topics in the age when everything is deployed into the public cloud.