Observability

The shape of observable systems: three pillars, four canonical metric types, two measurement methods, and the SLO vocabulary. Vendor-agnostic — no SDK APIs, no product feature matrices.

The three pillars

Metrics

What is the rate / value over time?

Shape:
Numeric time series at a sampling interval (every 10s, every 1min)
Cost shape:
Constant per metric per time window — scales with cardinality, not with traffic
Strength:
Cheap, queryable, great for dashboards and alerts
Weakness:
No context for individual events — you cannot ask "why did this one request fail"

Logs

What exactly happened?

Shape:
Timestamped events — structured (JSON) or unstructured (free text)
Cost shape:
Scales linearly with traffic; storage + indexing dominates cost
Strength:
Detailed, unbounded vocabulary, easiest to add
Weakness:
Volume + cardinality blow out costs if not managed; querying at scale is slow

Traces

How did this one request travel through the system?

Shape:
Tree of spans per request; each span has start/end + parent span id
Cost shape:
Expensive at high volume — sampled in practice
Strength:
Only pillar that shows cross-service causality for a specific request
Weakness:
Requires propagation (every service must pass the trace context); sampled views miss rare problems

The four canonical metric types

Nearly every metrics system inherits this taxonomy. Names may vary, but the shapes are stable.

Type Invariant Use for Example Pitfall
Counter Monotonically non-decreasing; resets only on process restart Counting events: requests received, errors, bytes written http_requests_total Never expose as a raw value — always view as rate(x[5m]). Raw counters increase forever
Gauge Arbitrary up-or-down current value Snapshots: queue depth, memory in use, temperature, open connections queue_depth Averaging gauges across instances usually loses information — prefer sum, max, or quantile
Histogram Observations bucketed by range; record count + sum + bucket counts Distributions: request latency, response size http_request_duration_seconds Quantiles are computed from buckets — accuracy is tied to bucket choice. Bad buckets → bad p99
Summary Client-side computed quantiles at fixed percentiles When you need exact quantiles from a single source and the client-side cost is fine rpc_duration_seconds{quantile="0.99"} Quantiles from summaries CANNOT be aggregated across instances — this is the main reason to prefer histograms

Histogram vs summary

Prefer histograms when you need to aggregate quantiles across replicas — summary quantiles are incorrect when merged. Use summaries only when one source has the full truth and cross-instance aggregation is not required.

RED vs USE

Two complementary methods. RED describes request flow; USE describes resource health.

RED

Request-driven services (APIs, web servers, RPC)

  • Rate — requests per second
  • Errors — number or percentage failing
  • Duration — latency distribution (p50, p95, p99)

USE

Resources (CPU, disk, network, database connections)

  • Utilization — % of time the resource was busy
  • Saturation — how much extra work is queued waiting
  • Errors — error events

SLI / SLO / SLA / error budget

SLI — Service Level Indicator

A measured value expressing one aspect of reliability. Typically a ratio: successful requests / total requests, or requests under 300ms / total requests.

SLO — Service Level Objective

A target value for an SLI over a rolling window. "99.9% of requests in the last 28 days return within 300ms." Your internal commitment.

SLA — Service Level Agreement

An SLO written into a contract with a penalty attached. Typically weaker than internal SLOs — if internal is 99.9%, SLA might be 99.5% with credits owed below that.

Error budget

1 − SLO, expressed as allowed downtime. A 99.9% SLO over 30 days = ~43 minutes of budget. Spent budget pauses risky releases; unspent budget authorizes them.

Why budgets are load-bearing

An error budget turns reliability from a preference into a trade-off. When the budget is spent, risky launches pause by policy — nobody has to argue about it. When the budget is healthy, velocity is authorized. Teams without budgets argue about reliability forever because there is no shared currency.

Distributed tracing — the mental model

Senior insight

Tracing only works if every service in the request path propagates context. A single service that drops the incoming trace header breaks the tree downstream — and tracing bugs almost always turn out to be propagation bugs.