Data Classification Pattern
How to label every field by sensitivity, then make the labels drive redaction, retention, residency, and access.
Data Classification Pattern
How to label every field by sensitivity, then make the labels drive redaction, retention, residency, and access.
TL;DR (human)
Every field in the system has a classification tag (Public, Internal, Confidential, Restricted, PII). The tag drives behavior: logger redaction, retention windows, storage encryption, region routing, DSAR exportability. Classification lives in the schema, not in a spreadsheet — agents see it where they write the code.
For agents
The classification ladder
| Class | Examples | Storage | Logs | Retention default | Cross-region |
|---|---|---|---|---|---|
| Public | Product name, public docs, marketing | Anywhere | Logged in full | Indefinite | OK |
| Internal | Workspace names, non-sensitive metadata | App stores | Logged with sanitisation | Indefinite (with cleanup) | OK |
| Confidential | Business logic, internal metrics, API URLs | App stores | Logged with redaction | 1–7 years | Region-aware |
| Restricted | API keys, OAuth tokens, internal secrets | Vault only | Redacted always | Per rotation cycle | Vault per region |
| PII | Name, email, phone, address, IP | Encrypted at rest; access-audited | Redacted; never logged in full | 7 years default; DSAR-deletable | Region-pinned per sovereignty |
| Sensitive-PII | SSN, payment card, health | Tokenised via specialist provider | Never logged | Minimum retention; regulated | Per regulation |
Two-tier PII split: regular PII (commonly handled by most products) vs sensitive PII (PCI-DSS, HIPAA, PSD2 — often tokenised via specialist providers like Stripe so you don't store it directly).
Where the classification lives
In the schema. Annotation alongside the field:
const User = z.object({
id: z.string().uuid(), // Internal
workspaceId: z.string().uuid(), // Internal
email: z.string().email().describe("pii"), // PII
hashedPassword: z.string().describe("restricted"), // Restricted (don't log; never return)
preferences: z.record(z.unknown()), // Internal
createdAt: z.string().datetime(), // Public
});The describe() (or your schema metadata) carries the classification. Tools introspect it.
Alternative: a separate *.classification.ts file per schema package, mapping field paths to classes. Either works; consistency matters.
What the classification drives
Logger redaction:
// The logger introspects the schema; redacts any field tagged PII / Restricted.
logger.info("user-created", User.parse(rawUser));
// → email redacted as "***@***", hashedPassword as "***", others passed through.Database storage encryption:
- Restricted: column-level encryption (per-row key from vault).
- PII: per-tenant key encryption at rest (rotation across all PII tables on key rotation).
- Sensitive PII: typically not stored — tokenised externally.
Retention:
- Storage layer reads classification → applies retention default.
- Legal hold flag suspends.
- DSAR deletion respects classification (PII deletable; audit-ledger PII retained per regulator exemption).
Region routing:
- PII writes to user's region only.
- Cross-region replication for non-PII; pinned for PII.
- See
../architecture/multi-region-pattern.md.
Access control:
- PII fields require a specific capability (
pii:read). - Restricted fields require step-up auth.
- All sensitive-class accesses audit-logged.
Redaction discipline
Three redaction modes, by class:
| Class | Redaction in logs | Redaction in error messages | Redaction in DSAR export |
|---|---|---|---|
| Public | none | none | none |
| Internal | none | none | exported as-is |
| Confidential | sanitise keys (no value leakage) | minimise | exported as-is |
| Restricted | always full redaction | never echoed | never exported |
| PII | always redacted to non-identifying form | never echoed | exported (user owns their data) |
| Sensitive-PII | never logged | never echoed | per regulation |
Redaction at the logger boundary, not at the call site. Call sites cannot forget; agents writing handlers cannot leak.
DLP (Data Loss Prevention) controls
Beyond redaction, DLP scans for accidental classification breaches:
- Outbound network traffic: scan for known PII patterns (email regex, credit-card-with-Luhn, SSN format). Block + alert on match in a non-PII destination.
- Log streams: scan; alert if redactor missed a known-PII pattern.
- Database query results: optionally scan for cross-class leakage (Restricted appearing in a Confidential query).
DLP is a backstop, not a primary defense. The primary defense is classification-driven redaction.
Sovereignty + classification
When a customer's PII must stay in their region (GDPR / LGPD / regional law):
- PII fields are region-pinned at the storage layer.
- A tenant-id → region map drives routing (see multi-tenant isolation).
- Cross-region access enforced at the network / storage layer, not just app.
Sovereignty applies to PII; non-PII can replicate freely.
Right to erasure (DSAR / GDPR Article 17)
When a user requests deletion:
- Identify all PII fields for that user (the schema classifications make this systematic).
- Delete from operational stores.
- Audit ledger entries about the user: retained per regulator exemption (audit logs are legal evidence; classification.legalRetention overrides DSAR).
- Anonymise where possible: usage metrics keyed by hashed user id; aggregated counts stay.
- Proof of completion: signed record of the deletion (DSAR proof), retained.
The classification metadata makes this tractable. Without it, every DSAR is an archaeology project.
Sensitive PII (PCI / HIPAA / specialist regimes)
These deserve special treatment:
- PCI (payment card): do not store. Use a tokenising provider (Stripe, Braintree, Adyen). Your DB holds the token; the provider holds the card.
- HIPAA (PHI): classified; access-controlled; audit-logged per HIPAA requirements; BAA with cloud provider; encryption at rest mandatory.
- PSD2 (open banking): strong customer authentication; consent management distinct from your normal consent flow.
For each regime, document the scope (which fields fall under it), the compliance program (encryption, BAA, audits), and the data-handling boundary (often a separate microservice).
Gate
A classification gate scans the schema files:
- Every PII-classed field is associated with a known retention class.
- Every PII-classed field is logged-redacted (verified by static analysis of logger calls).
- Cross-class violations detected (a Restricted field flowing into a Confidential-only logger sink, e.g.).
This is harder than the basic gates; consider building it incrementally as the schema base grows.
Common failure modes
- No classification at all. Everything treated as equally sensitive (paralysing) or equally non-sensitive (leak risk). → Even rough Public / Confidential / PII split is a huge win.
- Classification in a spreadsheet. Drifts from code. → In-schema annotation.
- DSAR runs as an ad-hoc script per request. Slow + error-prone. → Tooling that walks the classification metadata.
- Audit logs scrubbed in DSAR. Lose evidence; regulator unhappy. → Audit logs retain per exemption.
- Cross-class flow (PII used as a primary key in a non-PII analytics table). → Schema gate forbids.
- Sensitive PII stored at all. PCI scope creeps because "we just need it for one feature". → Architectural ADR: do not store; use tokeniser.
Adoption path
- Pick the four classes that matter (Public, Internal, Confidential, PII as a starter).
- Tag the obvious fields (email, password, name, id).
- Wire the logger to read tags + redact.
- Wire DSAR to walk PII fields.
- Add Restricted + Sensitive-PII later as the regime requires.
Tagging incrementally beats trying to classify everything at once.
See also
universal.md— Rule 8 (PII).audit-ledger-pattern.md— sensitive accesses logged.vault-pattern.md— Restricted lives in vault.multi-tenant-isolation-pattern.md— tenant + class scope.../architecture/multi-region-pattern.md— region pinning for PII.