SDLC Phases/04 test

Phase 04 — Test

How 'tests pass' stops being a feeling and starts being a contract.

Phase 04 — Test

How "tests pass" stops being a feeling and starts being a contract.

TL;DR (human)

Five tiers of tests, ordered by cost. Spend most budget on tiers 1–2 (schema parse + unit). Reserve E2E for golden paths. Tests assert on codes / structure, not rendered text. Hermetic over E2E for bug repro. Verify-first before "fixing" a flaky test.

For agents

Test layers (target distribution)

Tier	Type	Runtime	% of suite
1	Schema parse / contract	<1 ms	~30%
2	Unit (pure functions, single class)	<10 ms	~40%
3	Integration (handler + store + adapter, in-process)	<500 ms	~25%
4	Visual regression / a11y	seconds	~4%
5	E2E (real app, real services)	minutes	~1%

Inverted pyramids (mostly E2E) produce flaky, slow suites with poor signal.

Per pillar — Test-phase discipline

Architecture

Every contract has a parse test (happy + reject).
Every error code is asserted somewhere in the suite (a separate gate scans for code: "\<CODE\>" assertions).
Handler return values are parsed by the result schema (catches handler bugs at boundary).

Security

Auth tests: missing principalId → AUTH_REQUIRED.
Tenancy tests: caller cannot access other-workspace data.
Egress tests: blocked host produces SECURITY_EGRESS_DENIED.
Audit tests: privileged action writes intent before execute.
Secrets tests: logger redaction works on known patterns.

UI-UX

A11y: axe scan on every changed screen (@axe-core in CI).
Visual regression: per-primitive snapshot in default + test brand kit.
Intl parity: every key exists in every shipped locale.
Empty-state coverage: every list surface has at least one empty-state test.

Quality

Per-package coverage hits its threshold.
Mutation testing on stable utility modules.
Property-based tests for parsers / serializers / math.
No it("works") / it("test 1") — names read like sentences.

Governance

PR-intent gate passes (manifest matches diff).
ADR / RFC integrity gate passes.

AI-collaboration

Verify-first before "fixing" a red signal.
Honest test reporting (failed tests quoted, skipped tests stated).

Triage protocol — when a test fails

Reproduce locally. Confirm the failure on your machine.
Stash + verify red on origin/main. If main is red, the failure is pre-existing — file an issue; do not "fix" it in your branch.
Determine tier. Could a lower-tier test pin this? If yes, add the lower-tier test, fix the bug, both turn green.
Fix. The fix is the smallest diff that flips the test from red to green without changing other behavior.
Add a regression test if missing. If the failure was a real bug not previously tested.

Hermetic over E2E for bug repro

When a bug is reported:

Try to reproduce in a unit test against the suspect module. Pin it.
If that's not enough, integration test wiring stores + handlers in-process.
E2E only if cross-process / browser-only behavior.

A 2-second unit test that fails reliably beats a 60-second E2E that flakes.

Tests assert on codes, not messages

// ✗ wrong — breaks on intl / copy change
expect(err.message).toContain("not authorized");

// ✓ right
expect(err.code).toBe("AUTH_FORBIDDEN");

// ✗ wrong — breaks on intl / copy change
expect(screen.getByText("Save")).toBeInTheDocument();

// ✓ right
expect(screen.getByRole("button", { name: /save/i })).toBeInTheDocument();

Determinism

No file system writes outside per-test temp dirs.
No network calls (mock the boundary).
No clock drift (inject the clock).
No global module state.

A test that passes in isolation and fails in parallel has hidden global state. Fix the test, not the order.

Common failure modes

Inverted pyramid. Mostly E2E. Slow + flaky. → Push to lower tiers.
Flake "fixed" by setTimeout. Hidden flake. → Find the deterministic signal; expect.poll() / waitFor().
Coverage 95% but error codes never asserted. → Separate gate scans for asserted codes.
Tests share fixtures via mutation. Order-dependent. → Fresh fixtures per test.
Mock at every layer. End up testing the mocks. → Mock at the trust boundary.

Exit criteria

Test is continuous, like Build. Each cycle exits when:

New behavior has its test in the same PR.
Coverage thresholds hold.
Suite runs deterministically in CI.

Pre-release adds: full mutation pass, full a11y pass, cold-prod walk of the demo script.

Phase 04 — Test

Phase 04 — Test

TL;DR (human)

For agents

Test layers (target distribution)

Per pillar — Test-phase discipline

Triage protocol — when a test fails

Hermetic over E2E for bug repro

Tests assert on codes, not messages

Determinism

Common failure modes

Exit criteria

See also

On this page