Blog
Jul 31, 2022 - 16 MIN READ
Performance Engineering End-to-End: From TTFB to Tail Latency

Performance Engineering End-to-End: From TTFB to Tail Latency

Performance isn’t a tuning phase — it’s an architecture property. This month I lay out an end-to-end mental model (browser → edge → app → data) and the practical playbook for improving both “fast on average” and “fast under load” without shipping fragile optimizations.

Axel Domingues

Axel Domingues

Most teams treat performance like this:

  1. Build features
  2. Ship
  3. Panic about “slow”
  4. Sprinkle caches and pray

That’s not performance engineering.

That’s performance debt collection.

This month is about a different posture:

Performance is an end-to-end architecture property — and the “end-to-end” part is the trap.

Because your users don’t experience “the backend” or “the frontend”.
They experience the critical path from click → pixels → interaction → trust.

And the brutal part is:

  • improving the average is often easy
  • improving the tail is where systems reveal their real design
This is the month we stop saying “it’s probably the DB” and start using a real mental model:

Browser → Edge/CDN → Load balancer → App → Dependencies → Data → Back again.

Performance is the sum of that path — plus the queueing you created along the way.

The goal this month

Build a practical end-to-end performance model: measure → attribute → fix → validate.

The core shift

Stop optimizing components in isolation. Optimize the critical path and the tails.

What “good” looks like

Fast for real users, fast under load, and changes don’t regress silently.

The secret weapon

Budgets + guardrails (CI, dashboards, alerts) so performance stays a property, not a hero project.


The End-to-End Performance Stack (A Useful Map)

When a page feels slow, the question is not “where is the bottleneck?”

The question is:

Where is time being spent on the critical path, and why does it get worse at p95/p99?

Here’s the map I use. It’s not perfect, but it’s actionable.

The main segments

  • Browser work
    • HTML parsing, JS execution, hydration, rendering, main-thread contention
  • Network + edge
    • DNS/TLS, CDN cache hits, compression, connection reuse, routing
  • Backend request path
    • auth, routing, serialization, thread pools/event loops, contention
  • Dependencies
    • DB queries, caches, queues, third-party APIs, service-to-service hops
  • Data & storage
    • index usage, lock contention, slow queries, write amplification, cold pages
If you want one rule:

Don’t optimize what you can’t attribute.

You can “feel” that something is slow — but you can only fix what you can locate on the path.

Metrics That Matter (And the Lies They Tell)

Performance metrics are like financial metrics: easy to game, easy to misread.

Frontend user experience metrics (what people feel)

  • TTFB (Time To First Byte): “how quickly did the server start responding?”
  • LCP (Largest Contentful Paint): “when did the page look meaningfully loaded?”
  • INP (Interaction to Next Paint) / responsiveness: “does it feel interactive?”
  • CLS: “does it jump around while loading?”

Backend service metrics (what systems do)

  • Latency distribution: p50, p90, p95, p99 (not just average)
  • Throughput: requests per second, jobs per minute, etc.
  • Errors + retries: the silent killers of tails
  • Saturation signals: CPU, memory, I/O, connection pools, queues
Averages are performance propaganda.

If your p50 is great and your p99 is awful, your users will still tell you “it’s slow” — because the tail is where trust dies.


TTFB vs Tail Latency: Two Different Problems

TTFB is often “path length”

TTFB usually improves when you reduce:

  • hops (too many services)
  • cold starts
  • expensive authentication / policy checks
  • synchronous dependency calls
  • payload serialization overhead

Think of TTFB as:

“How quickly can the system produce the first byte when it’s not stuck waiting in line?”

Tail latency is usually “queueing + contention + retries”

Tail latency usually explodes when you introduce:

  • shared resources (DB, caches, connection pools)
  • lock contention
  • GC pauses / memory pressure
  • noisy neighbors (multi-tenancy, shared clusters)
  • retry storms (client retries turning into load multipliers)
  • thundering herds (cache expiry, fan-out calls)

Think of tail latency as:

“What happens when the system is under pressure, and variance becomes your enemy?”

Many teams ship “optimizations” that help p50 and worsen p99. Example:
  • add a cache with low hit rate → adds latency on misses
  • add retries “for reliability” → creates retry storms under partial outages
Performance and reliability are married. You can’t fix one by breaking the other.

The Performance Pyramid (Where to Spend Your Time)

Not all optimizations are equal.

Here’s the order I like — because it produces stable wins.

This is the reason “performance experts” start with architecture, not flame graphs:

Most systems are slow because of shape, not because of a slow for loop.


A Practical Playbook: Measure → Attribute → Fix → Validate

I use the same loop for almost every performance project.

Define what “good” means (in numbers)

Pick targets that reflect user experience:

  • page-level: LCP, INP, TTFB
  • service-level: p95 latency per endpoint, error rate, saturation ceilings
  • and define “no regressions” (budgets in CI)

Capture the latency breakdown

You want a breakdown like:

  • frontend: JS execution, render, network
  • backend: time in app vs time waiting on dependencies
  • dependency: DB query time, cache hit/miss, third-party time

Use correlation IDs across logs/traces so “one request” stays one story.

Fix the biggest contributor on the critical path

Pick one:

  • remove a hop
  • remove a sync dependency
  • cache or precompute
  • reduce payload / response size
  • fix the slow query / missing index

Validate with real traffic patterns

Test in a way that matches reality:

  • warm vs cold caches
  • a mix of endpoints (not just one)
  • concurrency levels you actually see
  • failure modes (slow DB, partial outage, retry behavior)

Add guardrails so it stays fixed

If the improvement isn’t guarded, it’s temporary.

  • dashboards + alerts on p95/p99
  • performance budgets in CI
  • load test on critical flows in your release pipeline

The “User Journey” Lens: Performance Isn’t One Endpoint

Most meaningful slowdowns happen on flows, not single calls.

A checkout flow might be:

  • load catalog
  • add to cart
  • price calculation
  • payment intent
  • confirm order
  • send email

If each step is “acceptable”, the flow can still feel slow.

So I prefer this split:

Endpoint health

Latency and errors per API, per dependency, per region.

Journey health

Time across multi-step flows (critical business transactions).

A performance culture measures both.

Because your users live in journeys.


A Few High-Leverage Techniques (That Don’t Age Badly)

1) Reduce payloads aggressively

  • Trim response fields (don’t ship internal “everything” objects)
  • Use pagination and “include=…” expansions intentionally
  • Compress responses (gzip/brotli) and avoid huge JSON blobs

Why it works: less network, less parsing, less memory, less time everywhere.

2) Make expensive work asynchronous (without lying)

  • Move non-critical side effects off the request path
  • Use outbox + idempotent consumers for reliability
  • Return quickly with a stable state transition

Why it works: your p95/p99 stop paying for background work.

3) Cache at the edge when possible

  • CDN cache static assets
  • Cache public content pages
  • Use stale-while-revalidate patterns where correctness allows

Why it works: edge caching removes entire backend hops from the path.

4) Control fan-out

The classic tail-latency killer is “call 6 services in parallel”.

Even if each is “fast enough”, the combined result is gated by the slowest.

Mitigations:

  • remove calls from the critical path
  • combine data behind a single boundary
  • batch requests
  • degrade gracefully (show partial content)

5) Tame retries

Retries are not free.

Guardrails:

  • use timeouts with intent (shorter than your SLO budget)
  • jitter backoff
  • circuit breakers
  • cap retries
  • avoid retrying non-idempotent operations unless you have idempotency keys

Tail Latency: Where Architecture Gets Honest

If you want a mature performance practice, this is the line you cross:

You optimize not for “fast when idle”, but for “predictable under load”.

That means learning to love:

  • queue depth metrics
  • saturation dashboards
  • p99 per dependency
  • “how many concurrent requests can we handle without collapse?”

The biggest tail-latency traps I see


Performance Budgets: The Architectural Guardrail Most Teams Skip

Budgets are the difference between “we fixed it once” and “we stay fast”.

Examples:

  • page: LCP must stay under X seconds on median devices
  • API: p95 under Y ms for critical endpoints
  • payload: response size under Z KB for mobile flows
  • DB: query must stay under N ms for hot paths
Budgets are political in a healthy way.They force tradeoffs to be explicit:
  • “Is this extra field worth 30ms on mobile?”
  • “Do we need this additional dependency call?”
  • “Should this be async?”

Where budgets live:

  • CI checks (Lighthouse/Playwright, API perf tests, payload size checks)
  • dashboards and alerts
  • release gates for your critical flows

Resources

Web Vitals (Core Web Vitals)

A practical map of user-experience metrics (LCP/INP/CLS) and how they relate to real perceived performance.

The Tail at Scale (Jeff Dean)

A classic framing of why tail latency dominates large-scale systems — and why “average” is not enough.


FAQ


What’s Next

This month was about performance as an end-to-end discipline:

  • the critical path
  • the difference between p50 and p99
  • how queueing, retries, and fan-out create tails
  • and why budgets are architecture, not bureaucracy

Next month we go up a layer:

Cloud Infrastructure Without the Fanaticism: IaaS, PaaS, Serverless, Kubernetes

Because most performance discussions get stuck at “tune the app”…
…when the platform choices you made are often the largest constraint on latency, reliability, and cost.

Axel Domingues - 2026