Introduction

Have you had a system that looked fine on dashboards, but users still found it slow?

That gap defines performance work. Software performance is the time from user intent to a useful result along the slowest path.

Performance covers response time distribution, system capacity under load, and balancing speed with reliability.

In this article, I explain a mental model for performance: where latency comes from, why percentiles beat averages, how bottlenecks form, and what sorts of changes usually make a difference.

Type: Explanation (understanding-oriented).
Primary audience: beginner developers and leaders who want a usable model for performance, not a list of tools.

Scope and audience

Scope: software performance across backend services, databases, and user-facing applications. I focus on fundamentals that apply whether you write Python, Rust, JavaScript, or something else.

Not a how-to: I will show small examples, but I am not walking through a specific profiler or cloud vendor.

Prerequisites: basic familiarity with shipping software and reading metrics and logs. If you want the adjacent foundation, start with Fundamentals of monitoring and observability and Fundamentals of metrics.

TL;DR: software performance in one pass

When performance is bad, four questions clarify what is happening:

  • What is slow, for whom, and when? Define the failure in terms of percentiles and scope.
  • Where is time going? Split latency into components (compute, input/output, network, dependency waits).
  • What is saturating? Find the resource at or near capacity (processor time, memory, disk, network, locks, queues, connection pools).
  • What change will reduce work or waiting, and how will I prove it? Make one change, measure again, and confirm the effect.

If you skip the first step and jump to “optimize code”, you will often fix the wrong thing.

If you are skimming, read this section, then jump to “Why percentiles matter more than averages” and “Where performance usually goes wrong”. The rest of the article fills in the model.

A mental model: performance is time and waiting

Most performance problems are not “my code is slow” problems. They are “my code is waiting” problems.

At a high level, request latency is:

[ \text{latency} = \text{service time} + \text{queue time} ]

  • Service time is actual work: computation, serialization, database work, compression, encryption.
  • Queue time is waiting: waiting for a thread, a lock, a database connection, I/O, the network, or a downstream service.

An analogy is a lunchtime coffee shop. Service time is how long it takes to make a drink. Queue time is how long you wait in line when there are more orders than baristas. Near capacity, the line grows quickly even if each drink takes the same time.

“It was fine yesterday” can coexist with “it is slow today” because increased load or decreased capacity quickly leads to queueing.

End-to-end performance is a critical path problem

End-to-end performance is the time it takes a user request to traverse the slowest path.

That path usually crosses boundaries:

  • The client (input handling, rendering, JavaScript execution, mobile constraints).
  • The network (latency, packet loss, name resolution, encrypted connection setup).
  • The edge and load balancer (routing, caching, rate limits).
  • The application service (compute, serialization, memory pressure).
  • Dependencies (database, cache, queue, third-party services).

If you want concrete “shapes” to picture, I usually see one of these:

  • Browser → edge cache → API service → database.
  • API service → database with caching.
  • API service fan-out to multiple internal services, each with its own dependency.

End-to-end performance work starts by identifying the critical path and measuring the total time spent. For web performance, Web Vitals are a good model for user-perceived speed. Begin with Web Vitals.

Why percentiles matter more than averages

Averages hide pain.

If 95% of requests take 100 milliseconds and 5% take 5 seconds, the average seems fine, but users hitting the slow case think the system is broken.

Teams track percentiles like p95 and p99 to capture tail behavior, the slow end of the latency distribution that users remember.

If you have not used percentiles before, p95 is the response time that 95% of requests beat, and the remaining 5% are the tail.

For the classic argument, read The Tail at Scale by Jeffrey Dean and Luiz André Barroso, explaining why large systems amplify tails even when each component is “pretty fast”.

The core performance metrics (and what they imply)

I use a small set of metrics that map cleanly to decisions.

Latency

Latency answers: “How long does it take?”

  • Track the median (p50), the 95th percentile (p95), and the 99th percentile (p99), not just averages.
  • Separate client-perceived latency from server-side time when you can.
  • Split by endpoint and by cohort (region, device class, customer tier) when it changes the story.

Throughput

Throughput answers: “How much work per unit time?”

Throughput is usually requests per second, jobs per minute, or rows per second. It is not automatically good or bad. Higher throughput is useful only when latency and correctness stay acceptable.

Error rate

Error rate answers: “Is it failing while it is fast?”

A system that becomes “fast” by returning errors is not performant. Performance and reliability are coupled.

Utilization and saturation

Utilization answers: “How busy is the resource?”

Saturation answers: “Is work waiting because this resource is at capacity?”

Many teams mistake high utilization as a problem, but it often signals saturation. Saturated resources can face sharp latency spikes with slight load increases.

For finding saturation, Brendan Gregg’s USE method is effective: check utilization, saturation, and errors for each primary resource. See The USE Method.

Performance requirements: what “good” means

Performance work needs a definition of success. Otherwise, you can optimize forever.

I like requirements that look like this:

  • “For POST /checkout, p95 latency is under 800 milliseconds under expected peak load.”
  • “For search, p99 latency is under 2 seconds for logged-in users.”
  • “The system sustains 1,000 requests per second with an error rate under 0.1%.”

If you have service level objectives (SLOs), define the term once and reuse it: a service level objective (SLO) is a target level of reliability or performance you want to achieve. SLOs help you treat performance as a contract, not a vague preference.

Measurement is a feature, not a phase

Performance is hard to evaluate without affordable measurement. Performance work is a cycle, not a checklist. The goal is to reduce guessing by tightening the feedback loop.

stateDiagram-v2 [*] --> Measure Measure --> Explain Explain --> Change Change --> Verify Verify --> Measure Measure: Measure reality
(percentiles, errors, saturation) Explain: Build a model
(where time goes,
what saturates) Change: Make one change Verify: Re-measure and
confirm the effect

If you can’t measure, you can’t verify. If you can’t prove measurements, you’re guessing.

A small code example: measuring and summarizing latency

You can build helpful intuition by measuring latency and percentiles, even without a full stack trace.

Here’s a Python example that times an operation and reports p50 and p95.

import time
import statistics

def percentile(samples, p):
    if not samples:
        raise ValueError("no samples")
    samples = sorted(samples)
    k = int((len(samples) - 1) * p)
    return samples[k]

def timed_call(fn, iterations=200):
    durations_ms = []
    for _ in range(iterations):
        start = time.perf_counter()
        fn()
        end = time.perf_counter()
        durations_ms.append((end - start) * 1000)
    return durations_ms

def do_work():
    sum(i * i for i in range(50_000))

samples = timed_call(do_work, iterations=300)
print(f"p50={percentile(samples, 0.50):.1f}ms")
print(f"p95={percentile(samples, 0.95):.1f}ms")
print(f"avg={statistics.mean(samples):.1f}ms")

This isn’t production-grade benchmarking; it’s a way to train your instincts: tails exist even on your laptop.

Where performance usually goes wrong

Most performance incidents follow common patterns.

Pattern: saturation and queueing

A resource hits capacity, and work waits in line.

Common culprits:

  • Thread pools.
  • Database connection pools.
  • Locks and contention hotspots.
  • Single partitions (a “hot key” in a cache or database).
  • Downstream dependencies that slow down.

Queueing theory is a whole field, but a straightforward idea shows up everywhere: Little’s Law relates the average number of items in a system, the arrival rate, and the time in the system. If the arrival rate increases while the service time remains constant, the system accumulates work. Near capacity, even small traffic rises significantly increase wait times.

Pattern: the “fast path” gets slower

A “fast” cache slows down due to small size, high miss rate, or stampedes.

Caching improves performance but adds complexity, making invalidation and failure modes more challenging to handle.

Pattern: work increases silently

The system is doing more than it used to.

Examples:

  • A query now returns 5,000 rows instead of 50 due to dataset growth.
  • A feature adds a loop that scales with customer size.
  • A retry policy turns a partial slowdown into a request storm.

Performance work aims to “make cost proportional."

Pattern: percentiles drift, not averages

This impacts teams most; the average is flat, but the p95 and p99 are increasing.

If you only track averages, you will miss the early warning.

Optimization levers that stay true across stacks

Most performance fixes are these when stepping back.

Reduce the work

  • Choose better algorithms and data structures.
  • Avoid unnecessary parsing, serialization, and copying.
  • Move costly work out of the request path if it can be asynchronous.

Do the work less often

  • Cache results where correctness allows it.
  • Precompute (carefully) and invalidate (carefully).
  • Batch small operations into fewer larger ones.

Reduce waiting

  • Fix contention and lock hotspots.
  • Right-size pools based on downstream capacity.
  • Remove unnecessary synchronous dependency calls.

Add parallelism with care

Parallelism can cut latency but also increase load and tail behavior.

If you add concurrency, verify you did not create:

Performance changes have failure modes

Most performance improvements shift work and risk.

Caching leads to invalidation, stampedes, and stale reads. Adding retries can cause request storms, while tightening timeouts improves tail latency but raises errors. The solution isn’t to avoid these entirely but to understand the trade-offs and plan for failure.

How performance fits with reliability and testing

Performance reflects reliability; if your system is “up” but unusable, users will think it’s down.

Two adjacent fundamentals matter most for end-to-end performance work:

Common misconceptions I often see

  • “Performance is just speed.” Performance includes tails, capacity, and behavior under load.
  • “If the average is good, we’re good.” Averages hide tail pain.
  • “We should optimize before we measure.” Without measurement, you are guessing.
  • “Caching is always a win.” Caching shifts problems into invalidation, stampedes, and correctness edge cases.
  • “Faster is always better.” Some speedups trade away reliability, debuggability, or cost discipline.

Key takeaways

  • Performance is time plus waiting, and waiting grows fast under saturation.
  • Percentiles (p95, p99) matter because tails are what users feel.
  • A small set of metrics maps to decisions: latency percentiles, throughput, error rate, and saturation.
  • The performance loop is measure, explain, change one thing, verify.
  • Most optimizations reduce work, reduce frequency, reduce waiting, or add safe parallelism.

Next steps

If you want to go deeper on adjacent fundamentals:

Glossary

Latency: Time between a request and its response.

Throughput: Work completed per unit time (for example, requests per second).

Percentile: A value below which a given percentage of observations fall (for example, p95).

Tail latency: The slow end of the latency distribution (often p95 and p99).

Saturation: A resource is at capacity, and work must wait (queues form).

Service level objective (SLO): A target for system behavior, like reliability or latency, used to make trade-offs.

References