Code generation was never the bottleneck.

Josh · June 12, 2026

Werner Vogels used his final re:Invent keynote, in December 2025, to leave developers with a warning he called verification debt: AI generates code faster than anyone can understand it, so software moves toward production before a person has confirmed what it actually does. The debt accrues quietly. Every merged change that nobody fully read is a small loan against the day the system surprises you, and the AWS CTO chose that, of everything he could have said on his way out, as the thing developers most needed to hear.

The numbers arrived within months. Sonar’s 2026 State of Code survey of more than 1,100 professional developers found that 96% don’t fully trust AI-generated code, while only 48% always verify it before committing. Sonar named the distance between those two numbers the verification gap, and the gap is the story: nearly everyone is suspicious, and roughly half act on the suspicion every time.

It would be comfortable to read that as a discipline problem, something a stricter review policy could fix. The June 2026 data says otherwise. New Relic’s State of AI Coding report found that 94% of engineering leaders rate AI-generated code as higher quality than human-written code at review time, and 78% report more incidents once it ships. New Relic’s name for this is unverified trust. The code looks better than ours and behaves worse, which means the looking is the part that’s broken. Review was always a sampling exercise that worked because a human had held the code in their head while writing it; point it at fluent machine output and it approves what it cannot actually vouch for.

Addy Osmani gave the same disease its other name in March 2026: comprehension debt, the growing gap between how much code exists in a system and how much of it any human genuinely understands. Unverified trust is the view from the incident channel; comprehension debt is the view from inside the team afterward, when the pager goes off and the engineer holding it is debugging something no one ever held in their head. Survey commentary through 2026 keeps returning to one image for that engineer: an auditor of unfamiliar code. Developers themselves have a folk term for the texture of the problem; respondents to Stack Overflow’s developer survey keep describing AI output as “almost right but not quite” (66% report the frustration), which is precisely the kind of wrong that sails through a tired review.

None of this is an argument against generation, and we’d be the wrong firm to make one; every line of software we ship is machine-written. It’s an argument about where the constraint actually sits. Typing was already the cheap part of software before the current tools existed. The expensive parts were knowing what a system does, agreeing on what it should do, and proving it still does it after a change. Generation tools made the cheap part nearly free while the expensive parts stayed expensive, so the constraint is now fully exposed: the industry industrialized generation and left verification artisanal.

Sonar’s prescribed workflow is “vibe, then verify,” and the slogan gets the order right. What it undersells is the scale. “Verify” in that sentence is still a human ritual appended to a machine process: open the diff, read what you can, approve. At 42% of committed code (Sonar’s figure, same survey) the ritual is already losing; at the share the same survey projects for 2027, which is 65%, it stops being a control at all. If generation runs at industrial scale and verification runs at reading speed, the gap widens by construction. Verification doesn’t need a better checklist; it needs its own factory.

That’s what we built. The Greenfield Production System treats verification as manufacturing rather than as review, and its spine is the behavior catalog: a typed, machine-checkable inventory of every observable behavior in a system, with each entry cited to file and line in the source it came from. A gate enforces the provenance by grepping the cited excerpt rather than trusting the line number. From the catalog, the factory deterministically projects test skeletons, documentation, and parity assertions; a model fills only the irreducible parts, and gates police the fill. Every gate a piece of work passes or fails is logged in a gate transcript you can inspect. And the machine that verifies is structurally separate from the machine that builds: the verification factory is read-only by construction, so builders never grade their own work.

Here is the kind of thing this machinery finds that diff review never will. One projection from the catalog is the enforcement matrix: every rule the frontend enforces, crossed against every guard the backend enforces. Each cell lands in one of three classes. Both layers is the healthy case, the rule checked on the client and enforced on the server. Backend only is usually fine; the server refuses what the interface happens to allow. UI only is the finding class: a rule the interface promises that the server never checks, which means it isn’t a rule at all. It’s a suggestion that holds until the first script talks to the API directly.

Rule	Frontend	Backend	Class
Status transition requires a resolution	`███Form.tsx:59`	`███Service.cs:212`	Both layers
Seat limit checked before assignment	`███Picker.tsx:118`	—	UI only
Attachment size cap	—	`███Controller.cs:74`	Backend only

A UI-only cell can’t show up in code review because it isn’t in any diff. It’s a property of the whole system, the relationship between two codebases that no single change ever touches, and you can only compute it from an inventory of the whole system. That’s the practical case for the catalog: some of the most expensive defects aren’t in the code anyone wrote last sprint. A fuller matrix excerpt, alongside a catalog sample, is in the proof library. Findings in this class ship as probe candidates, traced in source but not yet confirmed against a running system, with the exact API call that settles each one. The epistemic label is part of the artifact; a matrix that promoted every traced finding to “confirmed bug” would be committing the same sin it exists to catch.

All of it points at one standard of proof, and we use the plainest possible words for it: dual-green. The same behavioral test suite, projected from a catalog a human approved, running green on two systems: the original and its replacement, or the system before a change and after it. Dual-green turns “behaves the same” from a promise into a test result, and a test result is something you can re-run without trusting anyone, including us.

The market spent 2025 and 2026 naming this disease from every angle: verification debt, the verification gap, unverified trust, comprehension debt. The names are good, and what they share is the shape of the cure they imply: debt gets paid down with receipts. Code generation was never the bottleneck, and now nothing is left to hide that; the work in front of the industry is building verification that runs at the volume generation already does. Ours does, and the receipts are public.