Skip to main content

Command Palette

Search for a command to run...

Pydantic passed. Types matched. The downstream system still got garbage.

Updated
3 min read

Three production failures on one contract-extraction agent. They read as unrelated incidents and turned out to be one problem. The claim I'll defend: schema validation confirms the grammar is right and says nothing about whether the meaning is. Two jobs. Most teams build only the first.

Case 1: valid JSON, wrong semantics

Claude 3.5 Sonnet, Pydantic schemas. termination_clauses accepted list[str], so validation always passed. The model returned paraphrases instead of verbatim clause text, and the downstream tool matched exact strings against a database. The paraphrases matched nothing. A second-pass semantic check (a model call with a rubric: "are these strings verbatim from the source?") moved that field from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer you build deliberately.

Case 2: the retry cost spike

Retries through tenacity. A customer's documents had a dual-signatory clause with an optional co-signer. Schema expected co_signer: Optional[str]; the model kept emitting nested objects. Each retry ran about $0.04 and compounded past $2 per document on the bad cases. Fix: cap retries at 5 with escalation to human review, and audit new document types before production.

Lesson: unlimited retry logic on validation failures is a latent billing incident.

Case 3: the model-switch regression

GPT-4o to GPT-4.5. The party_obligations field (three-level nesting for conditional logic) fell from 91% to 73%, because the new model produced flatter structures on ambiguous cases. Valid JSON, wrong nesting, Pydantic passed it, downstream broke silently. Fix: shadow-evaluate after any upgrade. Run both models on the same production documents, flag any field below 95% agreement before shipping.

Lesson: model upgrades are schema-compatibility events.

The common thread

Not one of these was a Pydantic error. The schema was valid every time. The real causes were semantic drift, an uncontrolled retry loop, and a model-specific regression. The grammar was fine and the meaning was not, which is exactly what type validation cannot catch. The stack now: Pydantic for syntax, a light evaluator for semantics, DeepEval's correctness metric on text fields, retries capped, an escalation field on every schema, and a 200-document shadow-eval checklist on model changes.

Objections I'd accept / wouldn't

Accept: "stricter schemas would have caught some of this." Enums, discriminated unions, and constrained types do shrink the semantic-validation surface when the domain is stable and bounded. If that describes you, use them.

Wouldn't accept: "therefore you can skip eval around structured output." Two customer escalations out of three failures say otherwise. Stricter types reduce the surface; they never erase it, and they turn brittle the moment a new document shape shows up.

Where I'd push back on this

Steelmanning against myself: maybe I under-specified my schemas and relabeled it a semantics problem to feel better. A verbatim-quote field could be a constrained type backed by a span reference into the source, not a free str. Some of what I call "semantic validation" is "validation I never bothered to encode structurally." The concession: if you have run high-volume extraction with no semantic eval layer and kept accuracy above 92% for six months or more, I want to see the schema design. What I hold onto: some field, somewhere, has to assert meaning. If it is not the schema doing it, it is something downstream, and that something is where your next silent failure lives.