Agents Playbook
SDLC Phases/04 test

Phase 04 — Test

How 'tests pass' stops being a feeling and starts being a contract.

Phase 04 — Test

How "tests pass" stops being a feeling and starts being a contract.

TL;DR (human)

Five tiers of tests, ordered by cost. Spend most budget on tiers 1–2 (schema parse + unit). Reserve E2E for golden paths. Tests assert on codes / structure, not rendered text. Hermetic over E2E for bug repro. Verify-first before "fixing" a flaky test.

For agents

Test layers (target distribution)

TierTypeRuntime% of suite
1Schema parse / contract<1 ms~30%
2Unit (pure functions, single class)<10 ms~40%
3Integration (handler + store + adapter, in-process)<500 ms~25%
4Visual regression / a11yseconds~4%
5E2E (real app, real services)minutes~1%

Inverted pyramids (mostly E2E) produce flaky, slow suites with poor signal.

Per pillar — Test-phase discipline

Architecture

  • Every contract has a parse test (happy + reject).
  • Every error code is asserted somewhere in the suite (a separate gate scans for code: "\<CODE\>" assertions).
  • Handler return values are parsed by the result schema (catches handler bugs at boundary).

Security

  • Auth tests: missing principalIdAUTH_REQUIRED.
  • Tenancy tests: caller cannot access other-workspace data.
  • Egress tests: blocked host produces SECURITY_EGRESS_DENIED.
  • Audit tests: privileged action writes intent before execute.
  • Secrets tests: logger redaction works on known patterns.

UI-UX

  • A11y: axe scan on every changed screen (@axe-core in CI).
  • Visual regression: per-primitive snapshot in default + test brand kit.
  • Intl parity: every key exists in every shipped locale.
  • Empty-state coverage: every list surface has at least one empty-state test.

Quality

  • Per-package coverage hits its threshold.
  • Mutation testing on stable utility modules.
  • Property-based tests for parsers / serializers / math.
  • No it("works") / it("test 1") — names read like sentences.

Governance

  • PR-intent gate passes (manifest matches diff).
  • ADR / RFC integrity gate passes.

AI-collaboration

  • Verify-first before "fixing" a red signal.
  • Honest test reporting (failed tests quoted, skipped tests stated).

Triage protocol — when a test fails

  1. Reproduce locally. Confirm the failure on your machine.
  2. Stash + verify red on origin/main. If main is red, the failure is pre-existing — file an issue; do not "fix" it in your branch.
  3. Determine tier. Could a lower-tier test pin this? If yes, add the lower-tier test, fix the bug, both turn green.
  4. Fix. The fix is the smallest diff that flips the test from red to green without changing other behavior.
  5. Add a regression test if missing. If the failure was a real bug not previously tested.

Hermetic over E2E for bug repro

When a bug is reported:

  1. Try to reproduce in a unit test against the suspect module. Pin it.
  2. If that's not enough, integration test wiring stores + handlers in-process.
  3. E2E only if cross-process / browser-only behavior.

A 2-second unit test that fails reliably beats a 60-second E2E that flakes.

Tests assert on codes, not messages

// ✗ wrong — breaks on intl / copy change
expect(err.message).toContain("not authorized");

// ✓ right
expect(err.code).toBe("AUTH_FORBIDDEN");
// ✗ wrong — breaks on intl / copy change
expect(screen.getByText("Save")).toBeInTheDocument();

// ✓ right
expect(screen.getByRole("button", { name: /save/i })).toBeInTheDocument();

Determinism

  • No file system writes outside per-test temp dirs.
  • No network calls (mock the boundary).
  • No clock drift (inject the clock).
  • No global module state.

A test that passes in isolation and fails in parallel has hidden global state. Fix the test, not the order.

Common failure modes

  • Inverted pyramid. Mostly E2E. Slow + flaky. → Push to lower tiers.
  • Flake "fixed" by setTimeout. Hidden flake. → Find the deterministic signal; expect.poll() / waitFor().
  • Coverage 95% but error codes never asserted. → Separate gate scans for asserted codes.
  • Tests share fixtures via mutation. Order-dependent. → Fresh fixtures per test.
  • Mock at every layer. End up testing the mocks. → Mock at the trust boundary.

Exit criteria

Test is continuous, like Build. Each cycle exits when:

  1. New behavior has its test in the same PR.
  2. Coverage thresholds hold.
  3. Suite runs deterministically in CI.

Pre-release adds: full mutation pass, full a11y pass, cold-prod walk of the demo script.

See also