Testing

The test pyramid, the five test doubles everyone conflates, pytest idioms to know by heart, and the real causes of flakes.

What interviewers grade

Juniors say "I write unit tests." Seniors can distinguish unit / integration / e2e, pick the right test double for the seam, and explain why a specific test flakes. If you can diagnose a flake in two sentences, you signal real production experience.

The test pyramid

Level	Scope	Speed	Fragility	Role
Unit	One function / class; no I/O, no external services.	Milliseconds; run thousands per second.	Low — fail only when the code truly broke.	80% of tests. Cover logic branches, edge cases, boundaries.
Integration	Multiple units together with real collaborators (DB, cache, queue).	Tens to hundreds of ms each.	Medium — schema drift, fixture data, timing.	15%. Cover wiring: migration ran, ORM query works, txn boundaries correct.
End-to-end	Full system through the public entrypoint (HTTP, CLI).	Seconds to tens of seconds.	High — any cross-cutting change breaks them.	5%. One happy path per critical user flow; more is a tax on velocity.

Test doubles

Five distinct things that get called "mock" in casual conversation. Knowing the difference is a senior tell.

Double	What it is	Use for	Avoid
Dummy	Placeholder that satisfies a signature; never called.	Filling a required constructor parameter that the test path does not exercise.	Using a dummy when the test actually invokes the collaborator — silent pass.
Stub	Returns canned responses to configured calls. No assertions.	Testing code that queries something — you control what comes back.	Re-implementing the real dependency; stubs should be dumb.
Spy	Records calls made to it; test later asserts what happened.	Verifying a logger or analytics call was emitted correctly.	Over-asserting on internal calls — couples tests to implementation.
Mock	Pre-programmed with expectations; verifies calls during or after the test.	Verifying behavior at a seam, e.g., "payment service was called once with these args."	Mocking objects you own. Mock at the boundary of your system, not inside it.
Fake	Working implementation, but simplified (in-memory DB, in-process queue).	Integration-ish tests without the real dependency — fast and deterministic.	Divergence from real behavior. Test the real thing at least once in CI.

pytest idioms

Parametrize

— Run the same test body against a table of inputs.

@pytest.mark.parametrize("n,expected", [(0, 0), (1, 1), (5, 120)])
def test_factorial(n, expected):
    assert factorial(n) == expected

Fixture

— Shared setup/teardown with arbitrary scope.

@pytest.fixture
def db():
    conn = make_conn()
    yield conn
    conn.close()

Fixture scopes

— Amortize expensive setup across many tests.

@pytest.fixture(scope="session")  # or module, class, function
def app():
    return create_app()

monkeypatch

— Patch attribute, env var, or dict entry for one test.

def test_uses_env(monkeypatch):
    monkeypatch.setenv("FEATURE_X", "on")
    assert feature_enabled("x")

tmp_path

— Per-test temp directory; automatically cleaned up.

def test_writes_file(tmp_path):
    p = tmp_path / "out.txt"
    write_report(p)
    assert p.read_text() == "ok"

raises

— Assert a block raises an exception.

def test_rejects_negative():
    with pytest.raises(ValueError, match="must be >= 0"):
        withdraw(-1)

Marks + filtering

— Group tests by speed, tier, or feature; select at CLI.

@pytest.mark.slow
def test_big_scan(): ...
# Run fast tests only:
#   pytest -m "not slow"

Flake causes

Cause	Symptom	Fix
Time-dependent assertions	Test passes locally, fails on slow CI. "Expected 1.00, got 1.02."	Freeze time with `freezegun` or inject a clock. Compare with tolerance for timings.
Shared state between tests	Works alone; fails when another test runs first. Order-dependent.	Reset global/singleton state in teardown. Prefer pure functions and fresh fixtures.
Network / real external service	Fails when the network hiccups or the service is down for maintenance.	Fake / record-replay (VCR) for most tests. Real network only in a tagged slow suite.
Concurrency / races	Passes 99 times, fails 1 time. Harder on more cores.	Deterministic scheduling; inject a fake event loop; kill sleeps with explicit sync points.
Unseeded randomness	A rare input pattern fails once per week in CI.	Fix the seed in conftest.py. Record the failing input on assertion failure.
Filesystem / tmp collisions	Parallel test runs trip over each other on disk.	Use `tmp_path` (per-test dir). Never hard-code `/tmp/foo`.
Leaky test data	Today's test fails because yesterday's test left a row behind.	Rollback transaction per test; or truncate tables in teardown. Factories over fixtures for volume.

Senior rules of thumb

Test behavior, not implementation.

Implementation-coupled tests break on every refactor without catching real bugs.
Mock at the boundary — not inside your domain.

Mocking internal collaborators produces tests that pass even when the code is broken.
Prefer fakes to mocks when you can build one.

Fakes let you write integration tests that exercise real interactions cheaply.
One logical assertion per test.

Failed test names should answer "what broke" in one line.
Given-When-Then structure the test body.

Readable failures and reviewable PRs.
Flaky tests must be fixed or deleted, not retried.

Retry-until-green teaches the team to distrust the suite. Trust is the point.
Fast feedback > total coverage.

A 30s suite that runs on every save beats a 10-min suite that runs nightly.

TDD — when to reach for it

TDD buys you fast feedback, a design pressure toward testable seams, and a safety net during refactor. It is not a faith — if a test does not pay rent (catching real regressions), it is waste. Skip TDD when the shape is obvious; reach for it when the logic is gnarly or the cost of a bug is high.