Design real-time platform in microservices ecosystem
Microservices here, Microservices there… There a lot of knowledge sharing and best practices out there.
On this post, I will share my real experience using microservices under real-time requirements, that add significant challenges to a distributed architecture.
What would be the definition of “real-time”? I consider real-time as a timeline of request/response that takes no longer than 50ms. Indeed – that’s not real-time it’s closer to “near real time.” But, it’s still gotta be fast enough. Add to this million requests per second – now we are talking eh?
Challenges required from us: 1. Low Latency 2. High Scale. When you combine those two requirements together you understand that you are in trouble. Why? because most solutions out there (Queue providers, Frameworks, Databases, etc..) focus on low latency response or high scale. Having both together creates more challenges. I will add one more requirement to this equation – 3. Persistence. Every transaction must be tracked and handled whether it’s a success, failure it can not be lost
Let’s assume you have an API gateway service that receives HTTP requests. Those requests are processed and communicated internally between multiple services. Once the “magic” has done a response shall be returned to the invoker as soon as possible.
First challenge -> Being fast. The system must response asap. Think about that – you get a request and response must be immediate within dozens of milli’s else you’ll be timed out. That’s a feeling of someone is constantly chasing you with a knife and never let go. Our solution? We separate the flow in the system into two flows: Real-time flow and Offline-flow (will be detailed later on). The real time flow designed to be fast using non blocking frameworks and databases like Redis and an optimized queuing system to dispatch the requests as soon as possible. Additionally we leverage an asynchronous frameworks to keep the flow non-blocking or delaying any process. When we detect such incident we act immediately to release the bottleneck or re-design it
Marking messages – Everytime we process a request each service “mark” to itself that message was processed this give us Idempotence capabilities (usages will be detailed laten on)
We are measuring our real-time flow all the time. We have CI/CD system, which make our life easier, that deploy versions to separate LTE environnement to spot latency bottlenecks. We do measure constantly
We taking business decisions together with our customers to determine what should be on the real-time flow and what shouldn’t. For example if a transaction need to be cancelled (for any reason) does the customer want it to be cancelled right away or its OK if we will let them know after processing the events via offline reports? Business decisions can help a lot to minimize complexity – always listen to your project demands. Ref to other post which I discuss this topic at: https://www.linkedin.com/pulse/business-oriented-programming-your-key-rapid-idan-fridman/
Second challenge -> High Scale. We designed each of our services to scale from day one by keeping them stateless and recoverable. We created a backoff mechanism( details out of this post scope) which is tolerated to high-peaks. When service starting to feel in “trouble” it will slow down its events consuming in order not to die. Once Max threshold is reached an alert will be invoked and an auto-scaling process will create new instance right away. Once the backoff mechanism feels – “Back to normal” it will turn-on the service to consume messages normally.
Third Challenge -> Persistent. I wrote previously on our Real-time flow now I will detail the Offline one. The offline flow is more “Relaxed”. It is responsible to persist the data into different data-sources to ensure each transaction is persisted. Relaxed in our perspective meaning that messages can be consumed slowly in bigger batches when no timeout is “chasing them”. We tuned our kafka topics to be less “real-time” on that flow to maximize our resources. We are using different data-sources to denormalize our data for report and statistic queries. Additionally the offline flow persist each transaction data to allow us future cancellations and status checks. The separation of those flows give us more flexibility to handle events by it’s specific requirement
Inter-Communications. Our services based on Choreography architecture which enable each service to be an event subscriber. Events are flowing in the system and each service consume the events it was targeted to. This enabling us to easily maintain old and new business logic and additionally do easy modifications by versioning the topics and the events themself. Iam fan of small decoupled concern-domain services. I always encourse to split and merge services as the business evolve. I believe architecture is something that growing up with the business and must constantly maintained. This enabling us flex architecture that speed things up as there are are no ‘blocking’ goddess services around.
We have event expiration mechanism. There is no usage to process an event which is “too-old” and is candidate for being timed out. In case event is considered as timed-out it will forwarded to an error processing logic.
Supervising transactions. Since I mentioned that each of our transactions must be persisted we need to supervise them(and still to ensure we don’t add additional latency). In the real world events can get lost. Why? Consumer does not function as it suppose to, The queue provider is down or lagged, Machine dying, etc. That’s why we must supervise each transaction using an external component but still keep the architecture flex enough not making it as a bottleneck. How we do that? every time a new request coming up we send “copy” to our almighty Supervisor Service. the supervisor will listen to “Start” and “End” transactions events. in case a Start was consumed but never ended the supervisor tracker logic will recognize such scenario and notify the system: “Hey transaction XYZ never ended and will be considered as a candidate for a time-out transaction. Then we take an action -> Cancel or retry depends on the scenario.
Retrying transactions. Retrying is complicated topic. Things you need to consider: When stop retrying? when is it worth to retry? How to distinct between “retry-able” events and not, what happens if we re-try an already completed state (Idempotency) now lets add into this the technology barrier it is a challenge. We created libraries that enable each service to retry itself. We pushed this library as part of our infrastructure which enable each service to be focused on its business logic and have this capability out of the box. We believe in the method: “Each service is on it’s own”. Meaning I care for myself none else need to. Service understands when it shall retry, how many times and when to give up. Additionally service knows to rollback to its previous state but then what it means in respect to the other services and the transaction? This is a broad topic which I will detail on separate post.
So we talked about real-time platform challenges and a way to solve them. Also added more tips how to treat failures and challenging scenarios within microservice echo-system. Each topic is broad by itself but I hope this could give you jumpstart when you face the challenges. I will be happy to answer questions in comments
Next posts will demonstrate in details how we wrote our re-try libraries and the backoff mechanism leveraging our consumers methodology to keep things fast and tolerate to failures .
Best,
Idan
Published on Java Code Geeks with permission by Idan Fridman, partner at our JCG program. See the original article here: Design real-time platform in microservices ecosystem Opinions expressed by Java Code Geeks contributors are their own. |