Fundamentals of Software Availability


Diagram showing availability workflow from redundancy through health checks to graceful degradation

Software availability explained: uptime metrics, redundancy patterns, health checks, and graceful degradation for keeping systems accessible.

What Is a Retry Storm?


Diagram showing a retry storm as a feedback loop of increasing load and tail latency.

Retry storm: when retries multiply load and turn partial failures into outages. Learn how they happen, how to detect them, and how to prevent them.

What Is a Thundering Herd?


Diagram showing a thundering herd as a synchronized wave of clients stampedes a shared bottleneck.

Thundering herd: when many clients do the same work at once and overload a dependency. Understand why it happens, what it looks like, and how to reduce risk.

What Is Backpressure?


Diagram showing backpressure as a signal from a downstream component to an upstream component to slow down.

Backpressure: a system’s way of saying “slow down” before overload turns into timeouts and retries. Understand why it matters and what signals it uses.

What Is Load Shedding?


Diagram showing load shedding as a way to keep the system in a controlled state under stress.

Load shedding rejects work during overload so systems stay usable. Learn why it matters, what it looks like, and how it prevents retry storms.

Fundamentals of Reliability Engineering


Diagram showing SLOs, error budgets, and reliability targets working together

Understand reliability engineering fundamentals: how to define SLOs and error budgets, design reliable systems, balance reliability with innovation, and make data-driven decisions about system reliability.

Fundamentals of Incident Management


Diagram showing incident lifecycle from detection through resolution with runbooks, alerts, and automation

Understand incident management fundamentals: how to respond effectively when systems fail, build runbooks that work, create actionable alerts, and prevent incidents before they happen.