Why we retired Gherkin.

Josh · June 11, 2026

For a while, our factory had a station that generated Gherkin. Feature files, Given/When/Then scenarios, step definitions: the full Cucumber apparatus, produced at machine speed from the specifications upstream. The output looked exactly like what BDD always promised and rarely delivered, which was complete, consistently phrased, business-readable coverage of the system’s behavior.

The team that was going to consume those features rejected them.

We could dress that up, but the plain version is more useful. They didn’t dispute the coverage and they didn’t find the scenarios wrong; they didn’t want to own them. A generated feature file is a maintenance commitment written in someone else’s prose, and the people on the receiving end could see what we, at that point, couldn’t: scenarios in business English, at generated volume, are not more reviewable than the code they describe. They’re a second codebase whose compiler is a regex.

The rejection stung less than it should have, because by then we were assembling our own case against the format. It comes down to the round-trip.

Gherkin in a human BDD process earns its prose layer. A product owner and an engineer negotiate the words; the words are the meeting. But look at what the format makes a machine do. The factory starts with structured data about behavior. To produce Gherkin, it renders that structure into natural-language prose. Then, for the prose to execute, step definitions parse it back into structure, binding sentences to code through pattern matching. Structure to prose to structure, with the load-bearing layer in the middle being the one layer nothing can type-check.

Every crossing of that boundary is a place to drift. A model paraphrases; that’s much of what makes it useful, and in a prose layer that gets re-parsed, it’s exactly what you can’t afford. The same behavior phrased two ways becomes two steps, or one step with a regex loosened until it matches both, and a loosened regex is a bug that looks like hygiene. When a scenario failed, the debugging forked three ways: wrong behavior, wrong sentence, or wrong binding between the sentence and the code. Some of our flakiest generated tests weren’t flaky at the level of behavior at all. The round-trip itself was the flake: meaning got negotiated twice on its way to execution, by a writer and a parser who never met.

When two parties keep miscommunicating through an intermediary, you remove the intermediary. With a machine writing both sides of the conversation, the prose wasn’t communication anymore; it was an intermediate representation with bad properties. So we retired it.

What replaced it is the behavior catalog: a typed, machine-checkable inventory of every observable behavior in a system, each entry carrying typed semantics and a provenance citation to file and line in the source it came from. The citation isn’t decorative. A gate greps the cited excerpt out of the source rather than trusting the line number, so a citation that drifts fails the build instead of quietly lying. From the catalog, the factory projects test skeletons deterministically: same catalog in, same skeletons out, every run, with no prose in between to negotiate. A model fills only the irreducible parts, the assertions and setup that genuinely require judgment, and gates police the fill; selectors-grounded rejects any selector not found in the source, and no-todo-fill rejects a skeleton handed back with its hard parts stubbed out. Determinism is the anti-flake property the round-trip could never have. The projection can’t paraphrase.

The honest objection is that we threw away the thing Gherkin was actually for. Feature files exist so humans can read what the system does without reading code, and a typed catalog looks, at first glance, like a retreat from that. It took us a while to see that the readability had moved rather than died. A catalog entry is one row: the behavior, its semantics, where it lives in source. An inventory turns out to be easier to review than a narrative, because review of a narrative means reading all of it, and nobody reads all of it. We’d built a station for producing artifacts whose entire value depended on a kind of reading that doesn’t survive contact with generated volume.

That last point is the one that generalizes beyond test formats. When a factory regenerates a suite, the diff might be thousands of lines of test code, and asking a human to review that is asking for unverified trust: the reviewer approves what they can’t hold. What a human should review instead is the behavioral diff. Which behaviors were added to the catalog, which were removed, which changed semantics, each with its citation into source. That diff is small. It’s typed, it’s provenance-checked by a gate before any human sees it, and it’s the actual decision being made; the regenerated code is a consequence, the way a compiled binary is a consequence. Review the decision, let the machinery prove the consequence follows, and check the transcript if you want to see the proving.

Gherkin asked people to read prose so they could trust code. The catalog asks them to read a diff of behaviors so the code can be regenerated underneath them without re-earning trust by hand each time. The team that turned down our feature files was reviewing at the right level before we were, and their rejection was the cheapest correct review we’ve ever received.