Introduction
Have you ever watched a system fail, then fail harder because it tried to recover?
That’s a retry storm. It is a pattern where retries multiply load, amplify tail latency, and turn a partial failure into a wider outage. Instead of “trying again” being a small safety net, it becomes an accelerating feedback loop.
I care about this one because it wastes the two things I can’t get back: user time and engineer attention. Users get errors and slowness. Engineers get paged and end up guessing under pressure.
What a retry storm is
A retry storm occurs when a large number of clients retry failing or slow requests at roughly the same time, creating a surge of traffic that worsens the underlying problem.
You will often see it as:
- A sudden spike in request rate or queue depth.
- Error rate rising, then rising faster.
- Latency percentiles (p95, p99) are exploding.
- Downstream dependencies are getting hammered in lockstep.
Retries are not the villain. Unbounded, synchronized, or poorly targeted retries are.
Imagine calling a busy restaurant that is already behind. If everyone redials over and over, the phone line gets jammed, the staff is interrupted, and the kitchen falls even further behind. Retries do the same thing to a dependency that is already struggling.
How retries turn into a storm
Retries create hidden multiplication, which then creates a feedback loop.
If a user request triggers retries across layers, you may be doing more work than expected.
Here is a tiny example in code. It looks reasonable at first glance.
def call_dependency():
# Imagine this sometimes fails or times out.
raise TimeoutError("dependency slow")
def naive_retry(call, attempts=3):
last_exc = None
for _ in range(attempts):
try:
return call()
except TimeoutError as exc:
last_exc = exc
raise last_exc
def handler():
# One request. Three attempts.
return naive_retry(call_dependency, attempts=3)Imagine 1,000 clients hit your API simultaneously due to a cron job, cache expiry, or batch process. If a dependency is degraded, those requests could each make 3 attempts, totaling 3,000 attempts when the dependency is least able to handle them.
Increased load causes failures, which lead to more retries, further increasing load.
This gets worse when retries exist in multiple layers:
- A browser retries a fetch.
- The API gateway retries.
- The service retries the database call.
- The database client attempts to reconnect to the network.
When layers stack retries, engineers can’t determine how many requests hit the dependency.
Retry storms also pair well with a thundering herd, when many clients wake up at once, and all do the same work (often after a cache expires). If the recompute is slow or failing, the herd becomes a retry storm.
Storms cause saturation, leading to queued work and spikes in tail latency (p95, p99).
If you want the broader mental model for tails and saturation, see Fundamentals of Software Performance.
What makes retry storms more likely
Fixed retry delays (synchronization)
If every client retries exactly 100 milliseconds apart, they will retry together. This creates synchronized waves of load.
The usual antidote is randomness (jitter) plus a backoff curve so clients spread out rather than marching in lockstep. Jitter is a slight random offset added to the delay. AWS has clear guidance on this in AWS retries.
Retrying non-idempotent operations
Retries are much safer when the operation is idempotent, meaning doing it twice has the same effect as doing it once (or at least is safe).
Retrying “charge credit card” without idempotency causes headaches. It’s a different outage type, but it often occurs during widespread failures.
Retrying on the wrong errors
Some failures are retryable. Some are not.
Blindly retrying on every error worsens downstream outages. For example, if you get HTTP 429 or 503, the downstream indicates it’s overloaded.
Retries without a budget
If each request retries 3 times and traffic doubles, dependency load can quadruple or more.
A “retry budget” limits extra traffic on retries, enforcing discipline and preventing amplification.
What reduces retry storm risk
Design choices prevent common failure modes.
Timeouts that match reality
Timeouts are a contract. If they are too short, you create avoidable retries. If they are too long, you tie up resources and increase queueing.
When timeouts match real latency percentiles, you get fewer pointless retries and less resource waste.
Exponential backoff with jitter
Exponential backoff delays retries to allow recovery, while jitter prevents retries from syncing.
If you want a concrete reference, see AWS retries and Exponential Backoff and Jitter.
Limit concurrency
Unlimited concurrency causes outages. Set a cap on in-flight operations per dependency, known as a bulkhead. It prevents slow dependencies from using all workers.
Add parallelism with care
Parallelism can cut latency but also increase load and tail behavior.
If you add concurrency, verify you did not create:
- A thundering herd.
- Retry storms.
- A shared bottleneck that now saturates faster.
Circuit breakers and fast failure
A circuit breaker halts traffic to a clearly unhealthy dependency, fails fast to prevent system waste, and protects the dependency from overload.
A circuit breaker trips after failures and allows probes to check recovery.
The point isn’t to give up, but to stop worsening what is broken.
Respect backpressure
When a downstream says “slow down”, you need a mechanism to reduce speed. Without it, you keep pushing work into a saturated system and retry when it fails.
Backpressure includes rate limits, queue limits, load shedding, and signals like HTTP 429 with Retry-After.
How retry storms show up in production
I look for a few “shapes” that show up together:
- Request rate rises while success rate falls.
- Latency percentiles spike before error rate spikes.
- Dependency traffic rises even though user traffic is flat.
- Repeated, similar errors in logs, often timeouts.
When tracing, the smoking gun is often multiple attempts within a single trace, especially when retries span layers.
Where retries hide (and why storms surprise teams)
When a storm hits, the hardest part is that retries are often invisible. They live in:
- HTTP client libraries.
- Service meshes.
- Load balancers and gateways.
- SDKs for cloud services.
The quickest way to cut confusion is to document retry behavior per layer.
- Which errors are retried?
- How many attempts?
- What delay pattern?
- Whether there is jitter.
- Whether there is a retry budget.
If you do not know where retries are happening, you cannot reason about load.
Where to go next
If you want adjacent concepts that make retry storms easier to reason about:
- Read Fundamentals of Software Performance, for saturation, queueing, and tail latency.
- Read The Tail at Scale for why tails dominate large distributed systems.
- If you are using a cloud SDK, read its retry docs and confirm defaults. AWS starts you off in AWS retries.
Closing
Retry storms aren’t mysterious; they’re feedback loops from good intentions, trying again, overloading. The aim isn’t to eliminate retries but to make them proportional and bounded to prevent recovery from turning into self-harm.
References
- AWS retries, for practical retry guidance and recommended backoff patterns.
- Exponential Backoff and Jitter, for why jitter matters and how synchronized retries cause bursts.
- The Tail at Scale, for why tail latency dominates large distributed systems and why compounding tails hurt at scale.

Comments #