Databases

The four decisions interviewers push on: relational vs not, isolation level, replication mode, and sharding strategy. Plus a reminder on indexing.

The senior framing

"Which database" is the wrong first question. The right questions are: what is the access pattern, what consistency do I need, what is the read/write mix, and what is the failure mode I cannot tolerate? The database falls out of the answers.

SQL vs NoSQL

SQL vs NoSQL — pick by access pattern, not by hype

Criterion Relational (Postgres, MySQL) Non-relational (varies)
Data shape Normalized tables with known schema; cross-entity joins cheap Denormalized document / KV / wide-column; joins expensive or absent
Consistency Strong by default; ACID transactions across rows and tables Eventual by default in most distributed stores; tunable per op in some
Scale ceiling Vertical first; horizontal needs explicit sharding strategy Horizontal by design (DynamoDB, Cassandra); scale-out is the point
Query flexibility Ad-hoc queries on any indexed column; analytics-friendly Only the access patterns you designed for; re-design to add new ones
Best fit OLTP, financial, anything with cross-entity integrity High-volume KV, time-series, session stores, known-shape workloads

Rule of thumb: Default to Postgres until you have a concrete reason not to. "We might need scale" is not a reason; millions of rows fit on one Postgres node.

Isolation levels

Ordered weakest → strongest. Know what each level allows, not just what it prevents — that is where bugs live.

LevelPreventsStill allowsTypical use
Read Uncommitted Nothing beyond corruption Dirty reads, non-repeatable reads, phantoms, lost updates Never in practice. Many engines silently upgrade to Read Committed.
Read Committed Dirty reads Non-repeatable reads, phantoms, lost updates Postgres default. Fine for most web apps — each statement sees a fresh snapshot.
Repeatable Read Dirty reads, non-repeatable reads Phantoms in SQL standard (MySQL InnoDB blocks them anyway via gap locks); still lost updates without SELECT FOR UPDATE MySQL InnoDB default. Postgres Repeatable Read is actually snapshot isolation.
Snapshot Isolation Dirty reads, non-repeatable reads, phantoms Write skew (two transactions read overlapping sets, write non-overlapping, both commit) Postgres Repeatable Read. Good general default; watch for write skew in banking-style invariants.
Serializable All of the above including write skew Nothing — equivalent to some serial order Financial / invariant-critical paths. Highest cost; expect aborts and retry loops.

Replication modes

ModeHow it worksConsistencyTrade-off
Single-leader async One primary accepts writes; replicas apply the WAL after commit. Read-your-writes on primary; stale reads on replicas. Simple + fast writes. Failover may lose uncommitted tail of the log.
Single-leader sync (or semi-sync) Primary waits for ≥1 replica to ack before acknowledging client. No data loss on failover of acked writes. Write latency = primary + slowest synced replica. One slow replica stalls writes.
Multi-leader Writes accepted at any leader; leaders replicate asynchronously to each other. Conflicting writes require resolution (last-write-wins, CRDTs, app-level merge). Good for multi-region write locality. Conflict resolution is the hard part.
Leaderless (quorum) Client writes to W replicas, reads from R; overlap ensures freshness when W + R > N. Tunable per request. Hinted handoff and read repair fix stragglers. Complex failure modes; strong consistency is possible but expensive. DynamoDB, Cassandra.

Sharding strategies

StrategyHow it worksProsCons
Range Partition by key range: ids 0–1M → shard A, 1M–2M → shard B. Range scans stay on one shard; easy to reason about. Monotonic keys (timestamps, autoincrement) create hot-shard on inserts.
Hash Partition by hash(key) mod N (or consistent hashing). Even load distribution; no natural hotspots. Destroys range-scan locality; rescaling is painful without consistent hashing.
Directory / lookup A routing service stores a key → shard mapping. Maximum flexibility; resharding is metadata-only. Directory is a new bottleneck + single point of failure; needs its own scale plan.
Geographic Partition by region / tenant jurisdiction. Latency + data-residency (GDPR, HIPAA) solved in one move. Cross-region queries are the slow path; rebalance is political, not just technical.

Index types

TypeBest forAvoid forNote
B-tree (default) Equality + range + ORDER BY on indexed columns. Full-text search, high-cardinality multi-column filters with arbitrary order. Postgres / MySQL default. 99% of indexes you create are B-tree.
Hash Pure equality lookup at scale. Range, ORDER BY, prefix matching. Rarely worth it — B-tree equality lookups are already fast.
GIN / inverted Full-text, JSONB containment, arrays. Low-cardinality columns where a bitmap scan is better. Postgres GIN. Larger + slower to write than B-tree but enables @> and ts queries.
Covering (INCLUDE) Index-only scans — query reads only indexed + included columns. Wide payloads (defeats the memory-density point of an index). Index stores extra non-key columns so the row lookup is skipped.
Partial Indexing a hot subset (`WHERE status = 'active'`) to shrink index size. Predicates that drift over time (requires reindex). Huge wins when the hot slice is <5% of rows.

Indexing rules

  • Index the columns you filter or join on, not the ones you select.
  • Composite index order matters: (user_id, created_at) serves WHERE user_id=? ORDER BY created_at, not the reverse.
  • An index on a low-cardinality column (boolean, enum-of-3) rarely earns its write cost — consider a partial index.
  • EXPLAIN (ANALYZE, BUFFERS) is the ground truth. Never tune blind.
  • N+1 query is the #1 interview-day bug. Detect with query logs; fix with JOIN or batched IN (…).