Scaling

The axes you can push on, in roughly the order you should push them. Most teams reach for sharding or microservices long before the simpler moves are exhausted.

The senior move

When asked "how would you scale this?", resist naming an architecture. Start with the bottleneck: where is time actually spent, what is the hot path, what is the cheapest reversible move? Sharding is the last move, not the first one.

Order of operations

Measure. Know your p50/p95/p99 latency, throughput, and where time is spent.
Optimize the query + add indexes. Usually 10× wins from fixing N+1 and missing indexes.
Cache. Cheap, reversible, huge win on read-heavy paths.
Scale vertically. Simplest; postpones the architectural change.
Add read replicas. Splits read load without touching write path.
Offload async work. Move anything that can be eventual out of the request path.
Scale horizontally (stateless tier). Needs shared session store + stateless code.
Shard. Last resort; commit to the chosen key carefully — it is very hard to change.

Scaling axes

Axis	What it is	First move	Ceiling	Cost
Vertical (scale up)	Bigger box. More CPU, RAM, faster disk.	Always the first move. Fewer moving parts = fewer bugs.	One machine. Usually 10×–100× current load before you hit a cloud-provider instance limit.	Roughly linear in price, but downtime per upgrade and single-point-of-failure.
Horizontal (scale out)	More boxes, load-balanced.	Make app stateless — session in Redis / JWT, no per-pod disk state.	Bottleneck shifts to shared dependencies (DB, cache, queue).	Operational complexity: deployment orchestration, cross-instance observability, warm-up.
Read replicas	One primary for writes, many replicas for reads.	Route reads to replicas; keep "read my writes" paths on primary.	Replication lag becomes user-visible at high write volume.	Staleness, failover complexity, and the replica fleet itself.
Caching	Return hot reads from a fast store instead of the database.	Cache the most expensive + most-read queries. Cache-aside with short TTL.	Hit rate. Past 90% hit rate, each further point is expensive.	Invalidation complexity. A stale cache is worse than a slow query.
Sharding	Partition one dataset across many DBs.	Pick a shard key that matches the dominant access pattern.	Hot-shard problem; cross-shard queries are expensive.	Dramatic — schema migrations, resharding, cross-shard txns all become hard problems.
Asynchronous offload	Move slow / bursty work to a queue for background workers.	Identify synchronous non-critical work (emails, thumbnails, search index updates).	Only limited by worker fleet and queue durability.	Eventual consistency in the user flow; need dedupe and DLQ story.
CDN / edge	Serve static + cacheable dynamic content from POPs close to users.	Put every public GET through CDN. Version assets with content hashes.	Personalized / authenticated paths cannot be edge-cached trivially.	Cache-key discipline; invalidation fan-out.

Patterns worth naming

Pattern	Description	When to use	Gotcha
Stateless services behind a load balancer	Any instance serves any request. Session state in a shared store.	Horizontal scaling requires this. Treat it as table stakes.	Websockets / long-lived connections need sticky sessions or a pub/sub backplane.
CQRS (command-query separation at scale)	Write side and read side are separate stores; read side is optimized per query.	Read-heavy + many distinct read shapes (search, analytics, timeline).	Introduces eventual consistency between write and read models. Not free.
Event sourcing	Store the events, derive state. Read models are projections.	Auditability critical; replayable history; evolving read needs.	Event schema evolution is the hardest part; versioning events is forever.
Circuit breaker	Stop calling a failing dependency after N errors; fail fast; probe periodically.	Any synchronous cross-service call in a high-traffic path.	Half-open probe storms — randomize the retry timing.
Backpressure	Signal upstream to slow down when downstream saturates (bounded queues, 429s).	Any pipeline where unbounded buffering can exhaust memory.	Bounded queue + blocking producer can deadlock — prefer drop + retry headers.
Bulkhead	Isolate resource pools per dependency so one slow client cannot starve others.	Multi-tenant services, or services calling multiple downstream deps.	Over-partitioning wastes capacity; size pools by observed concurrency, not guesses.
Hedged requests	For tail-latency-sensitive reads, fire a second request after p95; take the first response.	Read paths where tail latency matters (search, recommendations).	Doubles load on the slow path; pair with tight timeouts.