Fundamentals of Software Caching

Introduction

Caching often improves performance quickly. It also introduces new failure modes and new correctness questions.

Adding a cache to slow systems reduces latency and backend load but can cause stale data bugs, confusing misses, expiry load spikes, or cache dependency failures in production.

Caching is straightforward as a concept. It speeds up systems by avoiding redundant work. It also creates a second copy of the data, which has its own rules.

What this is (and isn’t): This article explains caching concepts, trade-offs, and failure modes, focusing on why caching works and why it sometimes backfires. It does not walk through a step-by-step setup for Redis, a content delivery network, or browser caching.

Why software caching fundamentals matter:

Lower latency - Serving from memory or nearby edge is faster than recomputing or reloading.
Higher throughput - Your expensive backend does less work per request.
Better resilience - A cache absorbs spikes and can keep serving during partial failures.
Fewer surprise outages - Understanding stampedes, staleness, and eviction reduces incident amplification.

Caching is a trade that works well when constraints are explicit.

I use a simple mental model when I evaluate a cache:

Decide what “fresh enough” means.
Pick a caching pattern that matches your write and read behavior.
Plan for misses (cold start, eviction, invalidation).
Design for predictable failure (stampede protection, fallbacks, observability).

Cover: Diagram showing requests flowing through cache hits and misses, with freshness checks and eviction.

Type: Explanation (understanding-oriented).
Primary audience: beginner to intermediate software engineers building services that read data more often than they change

Prerequisites & Audience

Prerequisites: Basic understanding of HTTP requests, databases, and what it means for data to be “stale”.

Primary audience: Engineers working on web services, APIs, and backend systems. Also useful if you own a system and want to monitor its performance or reliability.

Jump to: Why caching works • Freshness and correctness • Common caching patterns • Failure modes • When not to cache • Misconceptions • Observability • Glossary • References

TL;DR: Caching fundamentals in one pass

Caching trades repeated work for state management. The performance win is measurable. The added complexity shows up later as staleness, misses, and operational risk.

Caches are about avoiding work (computation, network calls, disk reads).
Freshness is a product requirement (not a technical detail).
Misses are guaranteed (cold start, eviction, invalidation).
A cache is part of reliability (it can fail, overload, and amplify incidents).

The caching workflow:

[FRESH ENOUGH?] → [PATTERN] → [MISS PLAN] → [FAILURE PLAN]

This is the basic flow:

flowchart LR R[Request] --> C{Cache has fresh value?} C -->|Yes| H[Return cached value] C -->|No| B[Read source of truth] B --> P[Populate cache] P --> S[Return value]

Learning outcomes

By the end of this article, you will be able to:

Explain why caching reduces latency and load, and why the benefit depends on locality.
Explain why freshness is the central design constraint (and why “cache invalidation” is really product semantics).
Describe why common patterns exist (cache-aside, read-through, write-through, write-back).
Explain why cache stampedes happen and how to reason about preventing them.
Explain why a cache can reduce or increase system availability depending on how you integrate it.

Section 1: Why caching works – Avoiding work

At a high level, caching is keeping a copy of something expensive so I can reuse it.

The “something expensive” is usually one of these:

A network trip to another service.
A database query with input/output (I/O) and locking.
A slow disk read.
A computation that is expensive and repeatable.

Caching occurs at multiple layers, even if not labeled as such. CPU caches, OS page cache, browser caches, DNS caches, CDN caches, and application caches exist because repeated work and reuse are cheaper than recomputing.

Locality is the real engine

Caching only works when requests repeat in a way the cache can exploit. This is locality.

Temporal locality: The same thing is requested again soon.
Spatial locality: Related things are requested near each other (for example, nearby keys).

If your access pattern is genuinely random, a cache introduces overhead. That’s the workload’s reality, not a bug.

Hit rate is not a vanity metric

This simple model is useful:

$$ L_{avg} \approx (H \cdot L_{hit}) + ((1 - H) \cdot L_{miss}) $$

Where:

$L_{avg}$ is average latency.
$H$ is the hit rate (0 to 1).
$L_{hit}$ is the latency to serve from cache.
$L_{miss}$ is the latency on a miss (including the backend call and possibly populating the cache).

Two implications matter:

A cache with a “pretty good” hit rate can still lose if misses are catastrophically slow.
Improving hit rate is only valuable if it reduces the misses that matter.

Caching work pairs well with monitoring and observability. Track hit rate, hit latency, miss latency, and backend saturation together.

Section 2: Freshness and correctness – The price of a second reality

Caching is not about storage; it’s about meaning.

Caching claims: “This value is sufficiently accurate for a time.” It involves a product decision, whether admitted or not.

A photo of a whiteboard is a helpful analogy; it’s easier to share than bringing someone into the room, but it can also be wrong if the board changes.

“Fresh enough” is a requirement, not an implementation detail

Early on, let’s define the cost of being wrong for a bounded window (e.g., 30 seconds): who is harmed and how.

Some data tolerates staleness:

Blog content.
Product catalogs that change slowly.
Analytics dashboards.

Some data does not:

Money movement.
Permissions.
Inventory at the last item.

Caching sensitive data is possible, but the constraints are stricter. This is where fundamentals of software security and fundamentals of privacy and compliance apply directly to caching decisions.

Time-to-live is an opinion in seconds

Time-to-live (TTL) specifies how long cached data is considered valid.

TTL works well when:

Data changes slowly compared to reads.
Staleness is acceptable.
You can tolerate a minor inconsistency.

TTL breaks down when:

Updates need to be visible quickly.
A wrong answer is worse than a slow answer.
Data changes unpredictably, so there is no safe TTL.

How caches go stale (and why invalidation is hard)

“Cache invalidation” is notorious for involving multiple issues: detecting changes, defining what “changed” means, and displaying updates quickly.

Here are three common ways staleness is managed:

Time-based expiry (TTL): The cache expires after a set time, which is robust and straightforward but imprecise. It can serve stale data for up to the TTL, and cause misses even when the data hasn’t changed.
Explicit invalidation: A write triggers cache deletion or update, which is accurate but couples your write to cache behavior. If invalidation fails, the incorrect data persists until the TTL expires or is manually corrected.
Versioned keys: Instead of deleting keys, you move to a new key version when data changes, simplifying correctness as old values become unreachable. However, it increases cache churn and may raise memory pressure.

None of these is universally “best”. The choice depends on product semantics and operational constraints.

Consistency vs performance, pick your pain consciously

Caching is an explicit trade between strict consistency and performance.

When strict correctness matters, either:

Avoid caching that data.
Couple cache visibility tightly to updates (invalidation, versioning, or transactional approaches).

When performance outweighs correctness, accept some staleness and tailor product behavior accordingly.

Section 3: Common caching patterns – Why they exist

Patterns exist because teams keep tripping over the same problems.

This section identifies common patterns and explains what they assume and why they work.

Cache-aside (lazy loading)

Idea: The application reads from cache first. On a miss, it loads from the source of truth (usually a database) and writes to the cache.

Why it’s popular: It keeps the cache optional. If the cache is down, you can still serve from the database (maybe slower, but alive).

Where it bites: Miss storms and stale data on writes.

This minimal example shows the shape of the code. It is not a production cache client.

import time


class TTLCache:
	def __init__(self):
		self._store = {}

	def get(self, key):
		item = self._store.get(key)
		if not item:
			return None
		value, expires_at = item
		if time.time() >= expires_at:
			self._store.pop(key, None)
			return None
		return value

	def set(self, key, value, ttl_seconds):
		self._store[key] = (value, time.time() + ttl_seconds)


def get_user_profile(user_id, cache, db):
	cache_key = f"user:{user_id}:profile"
	cached = cache.get(cache_key)
	if cached is not None:
		return cached

	profile = db.load_user_profile(user_id)
	cache.set(cache_key, profile, ttl_seconds=60)
	return profile

A hit avoids the backend call. The key design questions are freshness (TTL or invalidation) and miss behavior under load.

Read-through

Idea: The cache is responsible for loading data on a miss, usually via a callback or integration.

Why it exists: It centralizes cache population. That can improve consistency and reduce duplicated logic.

When to use it: Use read-through when you want the cache layer to handle all data-loading logic, such as when multiple services need the same cached data, and you want centralized cache management.

Trade-off: It couples your cache layer more tightly to the source of truth, which can make failures sharper.

Write-through

Idea: Writes go to the cache and the source of truth at the same time, and the write only “succeeds” when both are done.

Why it exists: It reduces stale reads immediately after a write.

When to use it: Use write-through when read-after-write consistency is required, and you can accept the write latency cost, such as user profile updates that must be immediately visible.

Trade-off: You pay write latency on every write. You also have to decide what happens when the cache is down.

Write-back (write-behind)

Idea: Writes go to the cache first. The cache asynchronously flushes changes to the source of truth.

Why it exists: It speeds up writes and smooths write spikes.

When to use it: Use write-back when write throughput is critical and you can tolerate potential data loss if the cache fails, such as analytics event logging, where losing a few events is acceptable.

Trade-off: It can lose data if the cache fails, and it makes consistency harder to reason about.

Write-back is an advanced move because it changes durability and correctness assumptions.

HTTP caching and content delivery networks

HTTP caching often gets overlooked because it lives “outside” the application.

HTTP caching is powerful because it reduces network round-trip times. Round-trip latency dominates overall latency, and it compounds tail latency. For tail behavior in distributed systems, read The Tail at Scale by Jeffrey Dean and Luiz André Barroso.

When caching rules can be expressed in HTTP (Cache-Control, Entity Tag (ETag), If-None-Match), caching happens at browsers, reverse proxies, and content delivery networks, not just in the application.

Section 4: Cache failure modes – Why they hurt

Caching is a performance feature that changes failure behavior. That’s why it belongs in the same mental bucket as fundamentals of software availability.

Cold start and cache warmup

After deploying a new cache cluster or restarting a cache, the hit rate often drops toward zero for a while. That means the backend sees full load.

If the backend cannot handle the full load, the cache is not optional, and the system is fragile.

Cache stampede (dogpile) on expiration

A stampede typically looks like this:

A hot key expires.
Many requests arrive at the same time.
They all miss.
They all hit the backend.

If the backend is slow, those concurrent requests queue, time out, and retry. The cache “optimization” becomes the trigger for an incident.

One conceptual fix is request coalescing, sometimes called single-flight: one request recomputes, and the rest wait.

Another conceptual fix is to avoid synchronized expiry. If all instances set identical TTLs, hot keys can expire together across the fleet. Jittered expiry and stale-while-refresh behavior spreads load over time instead of concentrating it.

Many requests → one recompute → shared result

Cache penetration

Penetration is when you repeatedly request keys that do not exist, so you always miss.

This can come from:

User input that creates high-cardinality keys.
Bots scraping random IDs.
A bug generating nonsense cache keys.

This bypasses the cache by definition and pushes load to the backend.

Cache poisoning and unsafe keying

If a cache key does not include all the inputs that affect the output, the cache can serve the wrong data.

This ranges from annoying to catastrophic. Personalization bugs are nasty. Authorization bugs are worse.

This is one place where caching is a security surface, not just a performance feature.

Eviction surprises

If the cache is full, it will evict something. Eviction is not a failure, but it is a source of unpredictability.

Eviction surprises happen when:

The working set is larger than the cache capacity.
A new feature changes access patterns and thrashes the cache.
A few huge values crowd out many useful small ones.

Eviction explains many cases where performance regresses without a noticeable code change.

Negative caching and “not found” bugs

Caching “not found” results can be a significant optimization. It can also create hard-to-debug behavior when the data later appears.

If you negative-cache, you need an obvious story for how a later write becomes visible.

Read-after-write surprises

One common correctness failure is read-after-write inconsistency: a user changes something and immediately sees an older value.

This happens because caching changes visibility. The source of truth updates, but a cached read can still return an earlier copy. If the product expects immediate visibility after a write, this is a correctness issue, not just a performance side effect.

The key decision is whether read-your-writes consistency is required for this data. If it is necessary, caching behavior must enforce it.

Quick check: failure mode recognition

These are the core recognition signals:

Cold start: Hit rate collapses, and the backend absorbs full traffic.
TTL-driven stampede: Expiry synchronizes misses for hot keys and concentrates load.
Penetration: Sustained misses driven by nonexistent or effectively uncacheable keys (often high-cardinality input).

Section 5: When not to use caching

Caching works best when there is repetition and an acceptable freshness window. Without those, caching is usually the wrong first move.

Here are common situations to avoid or postpone:

The data must be correct right now (authorization, payments, safety-critical state).
The access pattern is too random (low locality workloads).
The system is already fragile and a cache would become a new dependency.
The main issue is slow code, not repeated work (a cache hides a bottleneck instead of fixing it).
You cannot observe it (no hit/miss metrics, no per-key visibility, no alerting).

If the problem is latency, start with fundamentals of software performance testing. A cache can help, but only after understanding the baseline and the bottleneck.

Section 6: Common caching misconceptions

Caching problems are often reasoning problems. The implementation can be correct, but the mental model can be wrong.

Here are a few misconceptions that come up repeatedly:

“A high hit rate means the cache is working.” Hit rate matters, but it isn’t enough on its own. Focus on what the cache protects (backend saturation) and its costs (staleness bugs, stampedes, operational risks).
“TTL is a correctness guarantee.” TTL is a freshness policy, not a correctness policy. A value can be “not expired” but still wrong if the world changed.
“Caches are safe because they’re optional.” They are optional only if the backend can survive cold starts and miss storms.
“Invalidation is always better than TTL.” Invalidation is more precise but adds coupling, which can be desirable or break the system.
“Caching is a performance change, not a product change.” If users can see stale data, it’s a product behavior whether you like it or not.

Quick check: misconceptions

These are common ways the mental model drifts:

High hit rate, slow system: Tail misses dominate perceived latency.
TTL set, still stale: Staleness is about update visibility and invalidation, not just expiry.
“Optional cache” assumption: A cache is only optional if the backend can absorb full load and miss storms.

Section 7: Observability for caches – Measure the trade-offs

If a cache can’t be measured, it’s a risk, not an optimization.

These are the signals to look for first:

Hit rate, by endpoint and by key family. One global hit rate hides the ugly parts.
Hit latency vs miss latency. This shows the value of a hit versus the cost of a miss.
Backend load as hit rate changes. If backend load does not drop, the cache is not protecting what it is intended to protect.
Evictions and memory pressure. Eviction is a common reason performance regresses after an initial improvement.
Error rates and timeouts to the cache. A cache can become a dependency even when you didn’t mean it to.

Also, track whether users and operators trust the data. If caching causes repeated correctness disputes, the performance gain is usually not worth the damage to the product.

Integrating caching without instability

Caching should exhibit predictable behavior under load and during failures.

The connections that matter

Caching sits at the intersection of a few fundamentals:

Performance: lower latency and lower backend load.
Scalability: higher throughput by doing less work per request.
Availability: degraded operation during backend incidents (if you design for it).
Correctness: defined staleness rules, safe keying, and safe invalidation.

If you’re thinking about growth, pair this with fundamentals of software scalability and fundamentals of reliability engineering.

Questions to answer next

When moving from understanding to implementation, these decisions matter most.

What “fresh enough” means for the highest-volume read paths.
Which cache signals will you treat as first-class health indicators (hit rate, hit latency, miss latency, backend saturation)?
What misbehavior is acceptable during an incident (wait, fall back, or fail fast).

A quick conclusion

Caching exploits locality and avoids repeated work, but it fails if freshness, eviction, and failure behaviors aren’t explicitly defined.

When those rules are explicit, caching is a predictable tool. When implicit, caching causes intermittent correctness issues and incident amplification.

What you now understand:

You have a mental model for evaluating any caching decision using four questions: (1) What does “fresh enough” mean for this data? (2) Which caching pattern matches your read and write behavior? (3) How will you handle misses (cold start, eviction, invalidation)? (4) How will the cache behave during failures (stampede protection, fallbacks, observability)?

This framework helps you make caching decisions that improve performance without introducing surprise outages or correctness bugs. Apply it to your next caching decision, and you’ll see the trade-offs more clearly.

Glossary

Cache hit: The requested value is found in the cache and served from it.

Cache miss: The requested value is either not in the cache or expired, so the system must fetch or recompute it.

Hit rate: The fraction of requests that are cache hits.

Time-to-live (TTL): A duration after which a cached value is considered expired.

Eviction: Removal of cached items to make room for others, usually based on a policy like least recently used.

Stampede (dogpile): Many concurrent requests miss at the same time and overload the backend.

Negative caching: Caching a "not found" result to avoid repeated lookups.

Source of truth: The system that owns the authoritative version of the data (often a database).

References

RFC 9111: HTTP Caching, for authoritative HTTP caching semantics and header behavior.
RFC 9110: HTTP Semantics, for the broader HTTP semantics that caching relies on.
Scaling Memcache at Facebook, for a practical look at large-scale cache infrastructure and failure handling.
The Tail at Scale, for why tail latency dominates perceived performance in distributed systems.
RFC 5861: HTTP Cache-Control Extensions for Stale Content, for the stale-while-revalidate and stale-if-error directives.
OSTEP: Paging, Introduction, for a clear explanation of locality and why caching works in memory systems.

Introduction#

Prerequisites & Audience#

TL;DR: Caching fundamentals in one pass#

Learning outcomes#

Section 1: Why caching works – Avoiding work#

Locality is the real engine#

Hit rate is not a vanity metric#

Section 2: Freshness and correctness – The price of a second reality#

“Fresh enough” is a requirement, not an implementation detail#

Time-to-live is an opinion in seconds#

How caches go stale (and why invalidation is hard)#

Consistency vs performance, pick your pain consciously#

Section 3: Common caching patterns – Why they exist#

Cache-aside (lazy loading)#

Read-through#

Write-through#

Write-back (write-behind)#

HTTP caching and content delivery networks#

Section 4: Cache failure modes – Why they hurt#

Cold start and cache warmup#

Cache stampede (dogpile) on expiration#

Cache penetration#

Cache poisoning and unsafe keying#

Eviction surprises#

Negative caching and “not found” bugs#

Read-after-write surprises#

Quick check: failure mode recognition#

Section 5: When not to use caching#

Section 6: Common caching misconceptions#

Quick check: misconceptions#

Section 7: Observability for caches – Measure the trade-offs#

Integrating caching without instability#

The connections that matter#

Questions to answer next#

A quick conclusion#

Glossary#

References#

Comments #