Reliability

Fundamentals of Timeouts

Timeout fundamentals for software: why timeouts exist, connection vs read vs write, choosing values, and avoiding cascading failures in distributed systems.

What Is the Exponential Backoff Pattern?

Exponential backoff is a retry strategy that increases wait time between attempts. Learn why it exists, how it works, and when to combine it with jitter.

Fundamentals of Software Availability

Diagram showing availability workflow from redundancy through health checks to graceful degradation

Software availability explained: uptime metrics, redundancy patterns, health checks, and graceful degradation for keeping systems accessible.

What Is Jitter?

Diagram showing how small timing variation spreads retries over time instead of creating synchronized bursts.

Jitter is timing variation that creates bursts and tail latency. Understand measured jitter, backoff jitter, and why it matters in retries.

What Is a Retry Storm?

Diagram showing a retry storm as a feedback loop of increasing load and tail latency.

Retry storm: when retries multiply load and turn partial failures into outages. Learn how they happen, how to detect them, and how to prevent them.

What Is a Thundering Herd?

Diagram showing a thundering herd as a synchronized wave of clients stampedes a shared bottleneck.

Thundering herd: when many clients do the same work at once and overload a dependency. Understand why it happens, what it looks like, and how to reduce risk.

What Is Backpressure?

Diagram showing backpressure as a signal from a downstream component to an upstream component to slow down.

Backpressure: a system’s way of saying “slow down” before overload turns into timeouts and retries. Understand why it matters and what signals it uses.

What Is Load Shedding?

Diagram showing load shedding as a way to keep the system in a controlled state under stress.

Load shedding rejects work during overload so systems stay usable. Learn why it matters, what it looks like, and how it prevents retry storms.

Fundamentals of Reliability Engineering

Diagram showing SLOs, error budgets, and reliability targets working together

Understand reliability engineering fundamentals: how to define SLOs and error budgets, design reliable systems, balance reliability with innovation, and make data-driven decisions about system reliability.

Fundamentals of Incident Management

Diagram showing incident lifecycle from detection through resolution with runbooks, alerts, and automation

Understand incident management fundamentals: how to respond effectively when systems fail, build runbooks that work, create actionable alerts, and prevent incidents before they happen.