Pydantic passed. Types matched. The downstream system still got garbage.

Three production failures on one contract-extraction agent. They read as unrelated incidents and turned out to be one problem. The claim I'll defend: schema validation confirms the grammar is right and says nothing about whether the meaning is. Two jobs. Most teams build only the first.

Case 1: valid JSON, wrong semantics

Claude 3.5 Sonnet, Pydantic schemas. termination_clauses accepted list[str], so validation always passed. The model returned paraphrases instead of verbatim clause text, and the downstream tool matched exact strings against a database. The paraphrases matched nothing. A second-pass semantic check (a model call with a rubric: "are these strings verbatim from the source?") moved that field from 61% to 94%.

Lesson: structured-output validation is syntax validation. Semantic validation is a separate layer you build deliberately.

Case 2: the retry cost spike

Retries through tenacity. A customer's documents had a dual-signatory clause with an optional co-signer. Schema expected co_signer: Optional[str]; the model kept emitting nested objects. Each retry ran about $0.04 and compounded past $2 per document on the bad cases. Fix: cap retries at 5 with escalation to human review, and audit new document types before production.

Lesson: unlimited retry logic on validation failures is a latent billing incident.

Case 3: the model-switch regression

GPT-4o to GPT-4.5. The party_obligations field (three-level nesting for conditional logic) fell from 91% to 73%, because the new model produced flatter structures on ambiguous cases. Valid JSON, wrong nesting, Pydantic passed it, downstream broke silently. Fix: shadow-evaluate after any upgrade. Run both models on the same production documents, flag any field below 95% agreement before shipping.

Lesson: model upgrades are schema-compatibility events.

The common thread

Not one of these was a Pydantic error. The schema was valid every time. The real causes were semantic drift, an uncontrolled retry loop, and a model-specific regression. The grammar was fine and the meaning was not, which is exactly what type validation cannot catch. The stack now: Pydantic for syntax, a light evaluator for semantics, DeepEval's correctness metric on text fields, retries capped, an escalation field on every schema, and a 200-document shadow-eval checklist on model changes.

Objections I'd accept / wouldn't

Accept: "stricter schemas would have caught some of this." Enums, discriminated unions, and constrained types do shrink the semantic-validation surface when the domain is stable and bounded. If that describes you, use them.

Wouldn't accept: "therefore you can skip eval around structured output." Two customer escalations out of three failures say otherwise. Stricter types reduce the surface; they never erase it, and they turn brittle the moment a new document shape shows up.

Where I'd push back on this

Steelmanning against myself: maybe I under-specified my schemas and relabeled it a semantics problem to feel better. A verbatim-quote field could be a constrained type backed by a span reference into the source, not a free str. Some of what I call "semantic validation" is "validation I never bothered to encode structurally." The concession: if you have run high-volume extraction with no semantic eval layer and kept accuracy above 92% for six months or more, I want to see the schema design. What I hold onto: some field, somewhere, has to assert meaning. If it is not the schema doing it, it is something downstream, and that something is where your next silent failure lives.

Pydantic passed. Types matched. The downstream system still got garbage.

Case 1: valid JSON, wrong semantics

Case 2: the retry cost spike

Case 3: the model-switch regression

The common thread

Objections I'd accept / wouldn't

Where I'd push back on this

Comments

More from this blog

# Optional fields are where your structured output quietly goes wrong

Structured output broke on us three times. Here is what the failures had in common.

I put 6 LLM guardrail tools inline and measured what they cost me. Here is the latency-vs-recall tradeoff.

We version our tool schemas like an API contract, because the agent is a consumer

Command Palette

Case 1: valid JSON, wrong semantics

Case 2: the retry cost spike

Case 3: the model-switch regression

The common thread

Objections I'd accept / wouldn't

Where I'd push back on this

Comments

More from this blog