Factory
How the machine works
The Greenfield Production System is a framework of deterministic quality gates that turns an AI harness into a software factory. Work moves along tracks, station to station, and between stations sit gates: automated checks the work must pass before it advances. The transcript of those checks ships with the work. This page covers the line itself, the behavior catalog that feeds it, the separation between the factory that builds and the factory that verifies, and where people fit.
Overview
The production line
Five tracks: discovery, specification, backend, frontend, and integration. Each track is a sequence of stations, and a station's output advances only through a gate.
- 01 · Discovery
- repo survey
- gate domain map
- gate bounded contexts
- 02 · Specification
- journey specs
- gate API specs
- gate view specs
- gate ADRs
- 03 · Backend parallel track
- typed contracts
- gate commands & queries
- gate read models
- gate service tests
- 03 · Frontend parallel track
- views
- gate page objects
- gate component tests
- 04 · Integration
- cross-service events
- gate e2e journeys
- gate staging run
Deterministic gates and LLM-judged gates
Gates come in two kinds, and the transcript labels every row with which kind ran. A deterministic gate is a script; it passes or it doesn't. Among them:
-
tsc-noemitthe TypeScript compiler completes with no errors. -
provenance:provenance-anchoredevery cited excerpt resolves to a real line in the source, instead of trusting the line number the model wrote down. -
selectors-groundedevery selector a test uses exists in the code under test. -
no-todo-fillno placeholder bodies left where the projection expected real work. -
no-racing-count-assertno count assertions that race an eventually consistent read.
An LLM-judged gate scores the work against a rubric that is fixed before the run,
with a pass threshold of 90;
judge-score,
which scores an implementation against its approved spec's rubric, is one. A score below
threshold blocks the line the same way a failing compile does. A full build passes
31 gates (view the artifact) . We publish gate names
and the principles behind them; prompts and orchestration internals stay inside
the factory.
Below, an excerpt from a verification run, including two failures and their fixes. The full transcript, annotated gate by gate, is at /proof/transcript.
- selectors-grounded det ✓ pass 412 ms
- no-todo-fill det ✓ pass 96 ms
- provenance:provenance-anchored det ✕ fail 380 mscited excerpt does not resolve to a line in ███Service.cs; resubmitted with corrected spanAnnotation: The gate checks the cited excerpt resolves to a real line in the target source, rather than trusting the line number the model wrote down.
- provenance:provenance-anchored det ✓ pass 365 ms
- no-racing-count-assert det ✕ fail 88 mscount assertion races the read-model projection; rewritten to wait on the emitted eventAnnotation: Counting rows while a projection settles is a flake source; the gate rejects the pattern outright.
- no-racing-count-assert det ✓ pass 71 ms
- tsc-noemit det ✓ pass 4.2 s
- judge-score llm ✓ pass 9.1 s
The spine
Everything projects from the behavior catalog
The behavior catalog is a typed, machine-checkable inventory of every observable
behavior in a system, each entry cited to file and line. From it the factory
deterministically projects test skeletons, documentation, and parity assertions.
The LLM fills only the irreducible parts, and gates police the fill:
provenance:provenance-anchored
fails any citation whose excerpt can't be resolved to a real line in source, whatever
the model believed about it. Because the catalog is typed, the projection can't drift
from it.
The catalog replaced Gherkin. We generated Cucumber features for a while, the consuming team rejected them, and the structure-to-prose-back-to-structure round trip turned out to be its own flakiness source, so we retired the format in favor of deterministic projection. That decision is written up in Why we retired Gherkin.
A catalog excerpt, with typed semantics and provenance citations, is at /proof/catalog.
Architecture
Two factories, one rule
The production system runs as two machines. The build factory takes an approved specification and constructs software through the gated line above. The verification factory reads an existing system and produces the behavior catalog, the coverage-debt report, and the tests that close the gaps; in v3 we pointed it at our own prior work.
Builders never grade their own work. The verification factory is a separate, read-only machine: it never modifies the system it inspects, and the factory that builds is never the one that signs off.
That separation is construction, not policy. The verifier has no write path to the system it reads, so a finding can't be quietly fixed before it's reported, and a green result can't come from the thing being graded.
People
The human's place
An architect sets direction, reviews where judgment is irreducible, and owns the engagement. OTR Select was built on Factory v1: every line AI-written, with an architect reviewing at stations that are now automated gates. By v2 the line carried a full port of Bugzilla without those reviews. Each generation automates stations that previously needed a person, and the changelog records which ones.
What stays human is the part a gate can't decide: what should be built, and whether the catalog that defines acceptance is right. The architect's name is on the engagement, and the transcript shows what the machine did under that direction.
Reference
The record and the vocabulary
Release notes
Factory changelog
v1 built a production platform with an architect at the stations, v2 ported Bugzilla without one, and v3 verifies. Each entry carries its receipts.
Glossary
The working vocabulary
Behavior catalog, gate, dual-green, parity replay, and the rest, defined once with stable anchors so other pages can cite them.