Observability

The shape of observable systems: three pillars, four canonical metric types, two measurement methods, and the SLO vocabulary. Vendor-agnostic — no SDK APIs, no product feature matrices.

The three pillars

Metrics

What is the rate / value over time?

Shape:: Numeric time series at a sampling interval (every 10s, every 1min)
Cost shape:: Constant per metric per time window — scales with cardinality, not with traffic
Strength:: Cheap, queryable, great for dashboards and alerts
Weakness:: No context for individual events — you cannot ask "why did this one request fail"

Logs

What exactly happened?

Shape:: Timestamped events — structured (JSON) or unstructured (free text)
Cost shape:: Scales linearly with traffic; storage + indexing dominates cost
Strength:: Detailed, unbounded vocabulary, easiest to add
Weakness:: Volume + cardinality blow out costs if not managed; querying at scale is slow

Traces

How did this one request travel through the system?

Shape:: Tree of spans per request; each span has start/end + parent span id
Cost shape:: Expensive at high volume — sampled in practice
Strength:: Only pillar that shows cross-service causality for a specific request
Weakness:: Requires propagation (every service must pass the trace context); sampled views miss rare problems

The four canonical metric types

Nearly every metrics system inherits this taxonomy. Names may vary, but the shapes are stable.

Type	Invariant	Use for	Example	Pitfall
Counter	Monotonically non-decreasing; resets only on process restart	Counting events: requests received, errors, bytes written	http_requests_total	Never expose as a raw value — always view as rate(x[5m]). Raw counters increase forever
Gauge	Arbitrary up-or-down current value	Snapshots: queue depth, memory in use, temperature, open connections	queue_depth	Averaging gauges across instances usually loses information — prefer sum, max, or quantile
Histogram	Observations bucketed by range; record count + sum + bucket counts	Distributions: request latency, response size	http_request_duration_seconds	Quantiles are computed from buckets — accuracy is tied to bucket choice. Bad buckets → bad p99
Summary	Client-side computed quantiles at fixed percentiles	When you need exact quantiles from a single source and the client-side cost is fine	rpc_duration_seconds{quantile="0.99"}	Quantiles from summaries CANNOT be aggregated across instances — this is the main reason to prefer histograms

Histogram vs summary

Prefer histograms when you need to aggregate quantiles across replicas — summary quantiles are incorrect when merged. Use summaries only when one source has the full truth and cross-instance aggregation is not required.

RED vs USE

Two complementary methods. RED describes request flow; USE describes resource health.

RED

Request-driven services (APIs, web servers, RPC)

Rate — requests per second
Errors — number or percentage failing
Duration — latency distribution (p50, p95, p99)

USE

Resources (CPU, disk, network, database connections)

Utilization — % of time the resource was busy
Saturation — how much extra work is queued waiting
Errors — error events

SLI / SLO / SLA / error budget

SLI — Service Level Indicator

A measured value expressing one aspect of reliability. Typically a ratio: successful requests / total requests, or requests under 300ms / total requests.

SLO — Service Level Objective

A target value for an SLI over a rolling window. "99.9% of requests in the last 28 days return within 300ms." Your internal commitment.

SLA — Service Level Agreement

An SLO written into a contract with a penalty attached. Typically weaker than internal SLOs — if internal is 99.9%, SLA might be 99.5% with credits owed below that.

Error budget

1 − SLO, expressed as allowed downtime. A 99.9% SLO over 30 days = ~43 minutes of budget. Spent budget pauses risky releases; unspent budget authorizes them.

Why budgets are load-bearing

An error budget turns reliability from a preference into a trade-off. When the budget is spent, risky launches pause by policy — nobody has to argue about it. When the budget is healthy, velocity is authorized. Teams without budgets argue about reliability forever because there is no shared currency.

Distributed tracing — the mental model

A trace is the tree of spans for one request. Each span = one unit of work (an HTTP call, a DB query, a function).
Each span has a span id, a parent span id (null for the root), and a trace id shared across every span in the request.
Context propagation: the trace id + current span id travel in request headers (e.g., W3C Trace Context) so each downstream service can attach its spans to the correct tree.
Sampling: deciding whether to record this trace. Head-based samples at the entry point; tail-based samples after the full trace is assembled (lets you sample errors at 100%).

Senior insight

Tracing only works if every service in the request path propagates context. A single service that drops the incoming trace header breaks the tree downstream — and tracing bugs almost always turn out to be propagation bugs.