New Technique Cuts Microservice Latency

TL;DR: A new technique called adaptive hedging significantly reduces high-percentile (p99) latency in complex microservice systems. It works by intelligently re-issuing requests that are predicted to be slow, cutting latency by up to 74%. The system uses real-time monitoring and a budget to avoid overwhelming services.
Key facts
- Category
- Infrastructure
- Impact
- High
- Published
- Source
- InfoQ
Full summary
A new "adaptive hedging" technique reportedly reduces high-percentile latency in microservice architectures by up to 74 percent.
In complex microservice architectures, where a single user action triggers requests across many services, overall system performance is often dictated by the slowest responses. These "straggler" requests, which are slow but eventually complete, can accumulate and cause extremely high latency for a small percentage of users, known as p99 latency. A new technical paper details a method called "adaptive hedged requests" to combat this problem. The approach uses a statistical tool called DDSketch to estimate request durations in real-time, a windowed system to adapt to changing performance patterns, and a token-bucket mechanism to act as a budget, preventing the system from being overloaded with duplicate requests. The author reports this technique can reduce p99 latency by as much as 74 percent.
This is significant for any organization running large-scale distributed systems. High p99 latency directly translates to a poor user experience for a meaningful number of customers, which can lead to frustration, abandonment, and lost revenue. For developers and Site Reliability Engineers (SREs), managing this tail latency is a persistent and difficult challenge. This adaptive hedging approach provides a more sophisticated, data-driven way to improve system reliability and responsiveness without introducing significant new risks. It offers a practical strategy for CTOs and engineering leaders aiming to enhance the performance and consistency of their platforms, ultimately leading to better service quality.
Related on Notifire
Related stories
Primary source: InfoQ