MANAGING RETRY STORMS AND METASTABLE FAILURES IN MICROSERVICE ARCHITECTURES

УПРАВЛЕНИЕ «ШТОРМАМИ ПОВТОРОВ» И МЕТАСТАБИЛЬНЫМИ СБОЯМИ В МИКРОСЕРВИСНЫХ АРХИТЕКТУРАХ
Isaev D.
Цитировать:
Isaev D. MANAGING RETRY STORMS AND METASTABLE FAILURES IN MICROSERVICE ARCHITECTURES // Universum: технические науки : электрон. научн. журн. 2025. 10(139). URL: https://7universum.com/ru/tech/archive/item/20946 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.139.10.20946

 

ABSTRACT

Microservice architectures, characterized by highly decoupled and independently scalable services, have become a standard approach to building large-scale distributed systems. However, the reliance on network-based interactions between services introduces significant risks, particularly when failures occur and clients initiate multiple retries. While retry mechanisms can mitigate transient errors, they may inadvertently trigger catastrophic “retry storms,” induce overload, and lead to metastable failure states. This article offers a rigorous examination of core reliability concepts—from exponential backoff and jitter to more advanced techniques such as retry circuit breakers and retry budgets—highlighting their theoretical underpinnings and practical implications. Mathematical models, including Little’s Law, elucidate how systems with partial or absent feedback loops can spiral into sustained outages. In addition, the article discusses load shedding and deadline propagation as complementary strategies. Empirical simulations validate the effectiveness of these techniques by measuring metrics such as load amplification, recovery time, and error rates. The findings underscore the critical balance between maintaining high availability for transient issues and preventing large-scale systemic failures. This work thus serves as a guide for engineers and researchers seeking robust, evidence-based methods to improve fault tolerance in microservice ecosystems.

АННОТАЦИЯ

Микросервисные архитектуры, характеризующиеся высокой степенью слабой связности и независимо масштабируемыми сервисами, стали стандартным подходом к построению крупномасштабных распределенных систем. Однако зависимость от сетевых взаимодействий между сервисами сопряжена со значительными рисками, особенно когда происходят сбои и клиенты инициируют многократные повторные запросы. Хотя механизмы повторов могут смягчать кратковременные (транзиентные) ошибки, они способны непреднамеренно вызывать катастрофические «штормы повторов», провоцировать перегрузку и приводить к метастабильным состояниям отказа. В настоящей статье представлен строгий анализ ключевых концепций обеспечения надежности — от экспоненциальной выдержки (exponential backoff) и джиттера (jitter) до более продвинутых техник, таких как автоматические выключатели для повторов (retry circuit breakers) и бюджеты повторов (retry budgets), — с акцентом на их теоретические основы и практическое применение. Математические модели, включая Закон Литтла, разъясняют, как системы с частичными или отсутствующими петлями обратной связи могут скатываться в состояние продолжительных сбоев. Кроме того, в статье рассматриваются сброс нагрузки (load shedding) и распространение крайних сроков (deadline propagation) в качестве дополнительных стратегий. Результаты эмпирического моделирования подтверждают эффективность этих техник путем измерения таких метрик, как усиление нагрузки, время восстановления и частота ошибок. Полученные выводы подчеркивают критическую важность баланса между поддержанием высокой доступности при транзиентных проблемах и предотвращением крупномасштабных системных отказов. Таким образом, данная работа служит руководством для инженеров и исследователей, ищущих надежные, основанные на фактических данных методы повышения отказоустойчивости в экосистемах микросервисов.

 

Keywords: microservices, reliability, retry mechanisms, exponential backoff, retry budget, circuit breaker, metastable failure, distributed systems.

Ключевые слова: микросервисы, надежность, механизмы повторных попыток, экспоненциальная выдержка, бюджет повторов, автоматический выключатель, метастабильный отказ, распределенные системы.

 

Introduction

In modern distributed systems, especially those built according to the principles of microservice architectures, ensuring reliability and fault tolerance has taken on paramount importance [1]. Microservices are characterized by a high degree of interconnectivity and dynamically changing loads [2], which creates additional risks of cascading failures and metastable states. At the same time, retry mechanisms used to recover from short-term errors can both increase reliability and trigger “retry storms,” potentially causing a significant deterioration in the availability of a service or the entire system [3].

The relevance of this issue stems from the fact that many real-world failures are transient in nature: individual network glitches, brief spikes in latency, or hardware faults on a single node can often be resolved by a simple retry. However, in the case of more prolonged problems, such as incorrect releases, misconfigurations, or hardware failures, the system risks entering an overloaded state. Each failed request attempt provokes a new wave of retries, and the already burdened server receives an even larger flow of incoming requests [4]. In some scenarios, once the initial “trigger” cause has been removed, the system is unable to recover automatically due to the accumulated requests and increased traffic, indicating a metastable state [5].

The goal of this work is to investigate and consolidate modern methods of managing retries in microservice architectures so as to maintain high availability in the face of transient errors, while avoiding overloads and prolonged outages during serious failures. To achieve this goal, it is necessary to address several tasks:

  1. Analyze the basic concepts related to retries, including exponential backoff and jitter, and assess their effectiveness and limitations.
  2. Examine mechanisms that limit the total number of retries (such as a retry circuit breaker and a retry budget) and demonstrate how they help reduce “storm-like” overloads.
  3. Investigate supplementary approaches (deadline propagation, load shedding on the server side, etc.) and evaluate their impact on the overall system recovery time.
  4. Confirm the conclusions with empirical results via simplified simulations and identify the optimal combinations of these techniques.

This research thus contributes to the field of improving reliability in distributed systems, showing that a “simple” implementation of retries can lead to severe consequences if the risks of overload are not taken into account. The combination of exponential backoff, jitter, and mechanisms for limiting retries (retry budget, retry circuit breaker) can significantly reduce the likelihood of metastable states, while still providing a rapid response to transient failures [6]. The findings and recommendations presented here can be applied in the development and operation of microservice platforms of various scales and complexities.

Materials and Methods

In a microservice architecture, each functional subsystem is relatively small and communicates with others over a network [2]. This setup introduces additional risks: a failure or degradation in one service can cascade and affect all dependent components. The nature of failures is heterogeneous: some errors are short-lived (transient)—for example, network anomalies, latency spikes, or isolated hardware malfunctions [4]. Such transient problems are often resolved automatically, which justifies the use of retry mechanisms. However, if a failure is systemic and persists for a considerable time, uncontrolled retries may exacerbate the situation.

In microservice-based systems, reliability is frequently expressed via SLI, SLO, and SLA metrics (Service Level Indicators/Objectives/Agreements), which specify a target level of successful responses [1]. When an error is short-lived, retries help achieve the agreed level of availability by recovering from transient failures. However, during prolonged outages, an unrestricted increase in the number of retries creates an excessive load that may push the system into a so-called metastable failure state, where, even after the root cause is resolved, the system cannot return to normal operation without external intervention [4].

A “retry storm” arises when numerous clients receive errors simultaneously and, following their logic, issue additional requests. In open-loop systems, the outgoing request rate does not adapt to the actual state of the service. Thus, if the service stops responding, clients—unaware of the real overload—continue to send requests at the same or even greater volume [5]. Consequently, the service, already at its limit, becomes overloaded by a new wave of traffic.

In closed-loop systems, by contrast, there is feedback. For example, limiting the number of simultaneous threads (1 request = 1 thread) prevents new requests if all threads are already busy waiting for responses. This negative feedback loop automatically reduces the risk of collapse [1]. In practice, however, distributed systems are seldom “purely” closed; their behavior depends on client logic, timeout settings, resource limits, and so on [2].

Exponential backoff is a classic way to “spread out” load over time. It involves increasing the pause between retries according to a geometric progression, usually capped by a maximum interval [3]. In the case of short-term errors, this approach gives the service a chance to “recover” and avoid an immediate crash under a simultaneous wave of requests. Yet, for prolonged failures, exponential backoff merely delays the peak load—once the service recovers, all postponed retries arrive again, provoking a peak surge in requests [4].

Many implementations also introduce jitter—randomization of the delay—to reduce client synchronization, which can produce oscillatory RPS (Requests per Second) spikes. A common Full Jitter scheme chooses a random delay between 0 and the current backoff value [3].

To limit the total number of retries during extended service problems, certain mechanisms are used:

● Retry circuit breaker: The client stops retrying if the error rate exceeds a defined threshold (e.g., 10%). When the metric returns to normal, retries resume. This method offers quick reaction to a critical level of failures. However, in partial degradations (affecting only a subset of users), the threshold may trigger too frequently and block all traffic or be set too high to provide sufficient protection [7].

● Retry budget: This approach allocates a “budget” for retries—for example, 10% of the number of successful requests. If the client has completed 100 successful operations in one minute, it is allowed 10 retry attempts. This ensures that during widespread failures, overall load does not exceed the normal rate by more than 10% [4]. Such a method requires careful tracking of successful/failed calls but fits well with the “lightweight” client library philosophy.

A request circuit breaker, in contrast, rejects all incoming requests when errors exceed a certain threshold, effectively “zeroing out” the load and giving the system time to recover [7]. Nevertheless, during partial failures (for instance, one shard out of five goes down), this method can be excessive: healthy streams are also blocked [1]. An alternative or complementary approach is load shedding, in which the server discards requests that clearly cannot be served correctly (e.g., if the queue is too long or the request’s deadline has expired). This reduces wasted load but does not fully resolve the problem of “garbage” retries if the client continues generating them [5].

Deadline propagation involves having the client set a maximum allowable response time, which is passed along to the server (and can even propagate further along the chain of microservices). The server may reject the request immediately if it sees that the time has already elapsed or periodically check the remaining deadline and terminate processing to avoid unnecessary resource consumption [7]. This technique reduces the volume of “expired” operations but does not necessarily spare the system from the overall load of retries if the client persists in initiating them [4].

Queueing analysis and load evaluation in distributed systems often rely on Little’s Law, which states that the average number of clients in the queue (L) equals the product of the incoming flow rate (λ) and the average waiting (or service) time (W) [5]. Practically speaking, if request processing time grows due to backoff or internal overload, the system may reach saturation, causing new requests to wait indefinitely [2].

In the context of retries, Little’s Law clarifies the mechanism of “automatic” load reduction in closed-loop systems. When delays (backoff plus waiting for a response) are large, the number of “busy” clients grows to some limit, and the system stops accepting new requests until some clients finish [4]. In open-loop systems, however, no such safeguard exists: λ remains constant, and the “retry storm” intensifies the situation.

To assess the effectiveness of these mechanisms, simulation models are often employed [3]. Clients run in a virtual environment with configurable retry counts, delays, and jitter, while the server simulates response delays or errors within a specified timeframe. Performance metrics such as peak RPS, the degree of load amplification (the increase relative to the baseline load), and time to full recovery following a failure are then recorded. Although simplified, these models quickly validate hypotheses and compare different retry management algorithms over a wide range of parameters [2].

Hence, the theoretical underpinnings of retry management in microservice systems rest on several core principles: (1) distinguishing short-term from prolonged failures, (2) understanding the roles of open-loop versus closed-loop designs with respect to feedback, (3) employing exponential backoff and jitter to mitigate synchronized retry waves, (4) implementing limiting mechanisms (retry circuit breaker and retry budget) to prevent “storms” and accelerate recovery, and (5) using server-side techniques (load shedding, deadline propagation, request circuit breaker) as supplementary tools. Mathematical reasoning, including Little’s Law and classical queueing models, provides insights into the dynamics of load growth and the possibilities for limiting it. Based on these theoretical considerations, the subsequent practical section explores concrete simulations and code examples that illustrate system behavior and confirm the efficacy of the described approaches.

Results and Discussion

Within the system under discussion, a “client” is defined as a backend service sending requests to another backend service (the “server” or “service”). The observations and experiments were carried out in a system composed of multiple microservices that send requests to each other.

The initial motivation for this research came from encountering transient errors such as timeouts when one service called another. The idea arose to introduce multiple retries (retries), under the assumption that the API methods were idempotent. However, it quickly became apparent that there was a risk of “retry storms,” in which a struggling service is further overloaded by a surge of retries, hampering recovery. Exponential backoff was thus proposed as the first step to mitigate the load from repeated requests. Below is an example of exponential backoff code.

MAX_RETRY_COUNT = 3

MAX_DELAY_MS = 1000

DELAY_BASE_MS = 50

 

attempt_count = 0

max_attempt_count = MAX_RETRY_COUNT + 1

 

while True:

   result = do_network_request(...)

   attempt_count += 1

   if result.code == OK:

       return result.data

   if attempt_count == max_attempt_count:

       raise Error(result.error)

 

   delay = min(DELAY_BASE_MS * pow(2, attempt_count), MAX_DELAY_MS)

   sleep(delay)

To validate the hypothesis that exponential backoff was beneficial, simulation-based testing was used. This method models client and server behavior in a simplified manner yet allows for quick assessments of various algorithms. In one such simulation (fig. 1), clients send a request and wait for a response with a 100 ms timeout, while the server deliberately returns 100% errors in the interval [0.5s; 1s] to simulate being “down.” Each client makes up to three retries upon error. Without exponential backoff, the server’s request rate (RPS) quadrupled at the moment when retries kicked in. With exponential backoff, the surge was still present but with a noticeably reduced amplitude (fig. 2).

 

Figure 1. Server load amplification

 

Figure 2. Server load amplification with exponential backoff

 

However, the effectiveness of exponential backoff also depends on whether the system is open-loop or closed-loop. In an open-loop setup, the request flow remains constant, regardless of server load. In more realistic scenarios, there are feedback mechanisms such as a thread limit (1 request = 1 thread), which prevents new requests if all threads are busy. This kind of “closed-loop” system stops “excess” new requests, allowing exponential backoff to be more effective. This hypothesis was tested in another simulation (fig. 3), which introduced a cap on the number of active clients (those waiting for a response or sleeping between retries). As a result, the benefit of exponential backoff became more pronounced.

 

Figure 3. Closed-loop system

 

Another significant effect observed in the simulations was client synchronization. If all clients receive errors at the same moment, then wait for the same delay, they all resend requests in unison, producing “waves” of load. To mitigate this, jitter is added—a random component in the sleep duration between attempts. One popular implementation is called Full Jitter, where the delay is chosen uniformly between 0 and the nominal backoff time. The example code is:

# Same code as in the exponential backoff

 

while True:

   …  # Same code as in the exponential backoff

 

   delay = min(DELAY_BASE_MS * pow(2, attempt_count), MAX_DELAY_MS)

   delay = random_between(0, delay)

   sleep(delay)

Simulations confirm that jitter spreads out request spikes and leads to a more uniform server load (fig. 4). The effect is even more pronounced when CPU headroom is low or when many clients are making retries (fig. 5).

 

Figure 4. Jitter and client synchronization

 

Figure 5. Jitter and client synchronization with less CPU headroom

 

Although exponential backoff and jitter can safely resolve short-lived issues and lessen the risk of a “storm,” long and severe outages can produce lasting consequences. In one real-world system incident, a microservice release caused widespread errors, and rolling back the release did not restore normal operation. The entire backend remained at 100% CPU until traffic was drastically reduced. This phenomenon is known as a metastable failure state (MFS): once the initial cause is removed (the failed release), the system continues to perform poorly instead of returning to normal. Retries, even with exponential backoff and jitter, were identified as a possible contributing factor, because they amplify load whenever errors occur.

This leads to the question of why exponential backoff alone cannot eliminate load amplification. Analyses and diagrams showed that delayed retries simply push the peak load further in time. If the service outage window is extended, eventually all postponed retries still arrive and cause a spike. This was illustrated in another simulation (fig. 6), where the service downtime was lengthened: the same magnitude of load amplification was observed, only later in time.

 

Figure 6. Exponential backoff delay the retries

 

The fundamental challenge with retries is this: once the service is healthy again (for instance, after rolling back a failed release), it could handle the normal request flow immediately, if that were the only load present. However, a backlog of retries arrives in addition to normal traffic. The service is then overwhelmed, fails to respond in time, and triggers even more retries. Without retries, the system would have recovered as soon as the trigger was removed, but with retries, the “tail” of additional requests extends the time to recovery. Further simulations confirmed (fig. 7) that any retries—even with exponential backoff—prolong the recovery period.

 

Figure 7. Retries slow вown recovery

 

One possible reaction is to eliminate retries altogether, but that degrades reliability when occasional transient errors occur. Hence, the goal is to enable retries only when the service is healthy, and to minimize or disable them when the service is in trouble. Two main techniques achieve this:

• Retry circuit breaker: The client halts retries entirely if the error rate from the server exceeds a configured threshold, e.g., 10%. Once the server’s error rate drops below that threshold, retries resume.

• Retry budget (or adaptive retries): A fixed fraction of additional requests (e.g., 10% of successful requests) is allocated as a “budget” for retries. This ensures that if the server fails, it will not receive more than 10% extra load from retries.

Both methods keep extra load on the failing server within a defined limit. To compare these techniques, simulations were performed (fig. 8) alongside others such as simple retries with exponential backoff or no retries at all. The results confirmed that both the retry circuit breaker and retry budget effectively reduced load amplification and hastened recovery, although each has specific trade-offs. For instance, a circuit breaker can completely turn off traffic if a certain percentage of users see errors, thus depriving users who might otherwise receive a successful response. Meanwhile, a retry budget caps surplus traffic at a fixed percentage. Popular open-source libraries (AWS SDK, gRPC, etc.) implement similar concepts (e.g., “HasRetryQuota”). They also typically use exponential backoff and jitter in combination with the budget. Importantly, global statistics are not required; local monitoring of success/failure on each client is sufficient, assuming the client is long-lived.

 

Figure 8. Retry budget vs retry circuit breaker

 

Discussions on whether to use a retry circuit breaker or a retry budget also addressed server-side protections (load shedding), but if a server crashes or lacks resources, it cannot effectively shed load. Hence, allowing a “thick client” with logic that limits retries can be justified. Another alternative, sometimes called a request circuit breaker (as opposed to a retry circuit breaker), was tested. It may cut off all requests if an error threshold is reached, but in partial failures (e.g., a single shard outage), it can unnecessarily affect all users. A higher error threshold can avoid this, but reduces protection against load amplification. Thus, a retry budget remains a more flexible solution.

Finally, “deadline propagation” is another relevant mechanism. In this pattern, the client sets a maximum response time, and the server periodically checks whether the request’s deadline is already exceeded. If it is, the server terminates processing immediately. The rationale is that this avoids performing needless work if the client has given up. In additional experiments (fig. 9), deadline propagation was shown to shorten recovery time and keep queue lengths lower, even though it can increase “amplification” by allowing more rapid resubmissions (the server quickly discards requests that have timed out from the client’s perspective). Consequently, deadline propagation does not replace limiting the total number of retries but remains a useful complement.

 

Figure 9. Deadline propagation

 

In conclusion, while the seemingly straightforward practice of “just add some retries for transient errors” can be helpful in certain cases, it poses a serious risk in major outages, as retries can perpetuate metastable failure states and significantly lengthen recovery times. Exponential backoff and jitter are beneficial but do not resolve all issues, because they merely spread or delay load without eliminating it. Techniques such as a retry budget or retry circuit breaker (combined with exponential backoff and jitter) are necessary to prevent retry-driven “storms” and to manage the extra load during recovery. Moreover, server-side strategies—like load shedding and deadline propagation—provide additional safeguards. In real implementations, such as AWS SDK or gRPC, some form of adaptive retry limit is already in use, with local counters proving sufficient for success/failure tracking. Overall, balancing reliability (for intermittent or short outages) against the danger of massive overload (for severe incidents) requires careful tuning of retries. The simulations, sample code, and concepts presented here confirm that adopting methods like retry budgets, retry circuit breakers, and deadline propagation—alongside proper delay management—significantly reduces the risk of avalanche-like load growth and shortens the system’s recovery time in the event of serious failures.

Conclusion

This study underscores the complexity inherent in managing reliability through retry mechanisms in microservice-based architectures. By examining key theoretical constructs—ranging from exponential backoff and jitter to more advanced approaches such as retry budgets and circuit breakers—it becomes clear that naive implementations of retries can lead to severe overload scenarios or even metastable failure states. Empirical simulations provide evidence that balancing feedback mechanisms, selective retry policies, and server-side tactics (such as load shedding and deadline propagation) enhances the system’s ability to recover from prolonged disruptions. At the same time, these techniques must be adapted to the specific operational and architectural constraints of each environment, given the variability in service behaviors, latencies, and network conditions. Future work could explore more granular, adaptive retry algorithms integrated with real-time monitoring and autoscaling, pushing the boundaries of what is feasible for large-scale distributed systems. Ultimately, the insights provided herein demonstrate that thoughtful, empirically validated strategies for managing retries play a pivotal role in advancing the reliability and resilience of modern microservices.

 

References:

  1. Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site reliability engineering: how Google runs production systems. " O'Reilly Media, Inc.".
  2. Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.".
  3. Brooker, M. (2015). Exponential Backoff And Jitter. Retrieved from https://aws.amazon.com/ru/blogs/architecture/exponential-backoff-and-jitter/
  4. Wang, Y., Kadiyala, H., & Rubin, J. (2021). Promises and challenges of microservices: an exploratory study. Empirical Software Engineering, 26(4), 63.
  5. Little, J. D. (1961). A proof for the queuing formula: L= λ W. Operations research, 9(3), 383-387.
  6. Patterson, D. A., & Hennessy, J. L. (1994). Computer organization and Design. Morgan Kaufmann,.
  7. Nygard, M. (2018). Release it!: design and deploy production-ready software.
Информация об авторах

Engineering Director at Yandex Cloud, Russia, Moscow

Технический директор Yandex Cloud, РФ, г. Москва

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top