Factory

How the machine works

The Greenfield Production System is a framework of deterministic quality gates that turns an AI harness into a software factory. Work moves along tracks, station to station, and between stations sit gates: automated checks the work must pass before it advances. The transcript of those checks ships with the work.

Overview

The production line

Five tracks: discovery, specification, backend, frontend, and integration. Each track is a sequence of stations, and a station's output advances only through a gate.

01 · Discovery
1. repo survey
2. gate domain map
3. gate bounded contexts
02 · Specification
1. journey specs
2. gate API specs
3. gate view specs
4. gate ADRs
03 · Backend parallel track
1. typed contracts
2. gate commands & queries
3. gate read models
4. gate service tests
03 · Frontend parallel track
1. views
2. gate page objects
3. gate component tests
04 · Integration
1. cross-service events
2. gate e2e journeys
3. gate staging run

Backend and frontend run in parallel from the same specifications and meet at integration. Every transition is a gate; work that fails one returns to the station that produced it, and both the failure and the fix stay on the transcript.

Deterministic gates and LLM-judged gates

Gates come in two kinds, and the transcript labels every row with which kind ran. A deterministic gate is a script; it passes or it doesn't. Among them:

tsc-noemit the TypeScript compiler completes with no errors.
provenance:provenance-anchored every cited excerpt resolves to a real line in the source, instead of trusting the line number the model wrote down.
selectors-grounded every selector a test uses exists in the code under test.
no-todo-fill no placeholder bodies left where the projection expected real work.
no-racing-count-assert no count assertions that race an eventually consistent read.

An LLM-judged gate, like judge-score, scores the work against a rubric fixed before the run; below the pass threshold of 90, the line blocks the same way a failing compile does. A full build passes 31 gates (view the artifact) . We publish gate names and the principles behind them; prompts and orchestration internals stay inside the factory.

Below, an excerpt from a verification run, including two failures and their fixes. The full transcript, annotated gate by gate, is at /proof/transcript.

Gate transcript · excerpt verification run · 8 of 31 gates

selectors-grounded det ✓ pass 412 ms
no-todo-fill det ✓ pass 96 ms
provenance:provenance-anchored det ✕ fail 380 ms

cited excerpt does not resolve to a line in ███Service.cs; resubmitted with corrected span

Annotation: The gate checks the cited excerpt resolves to a real line in the target source, rather than trusting the line number the model wrote down.
provenance:provenance-anchored det ✓ pass 365 ms
no-racing-count-assert det ✕ fail 88 ms

count assertion races the read-model projection; rewritten to wait on the emitted event

Annotation: Counting rows while a projection settles is a flake source; the gate rejects the pattern outright.
no-racing-count-assert det ✓ pass 71 ms
tsc-noemit det ✓ pass 4.2 s
judge-score llm ✓ pass 9.1 s

8 gates 6 pass 2 fail det = deterministic · llm = judged against a rubric

The spine

Everything projects from the behavior catalog

The behavior catalog is a typed, machine-checkable inventory of every observable behavior in a system, each entry cited to the source it was read from. From it the factory deterministically projects test skeletons, documentation, and parity assertions; because the catalog is typed, the projection can't drift from it. The LLM fills only the irreducible parts, and gates police the fill: provenance:provenance-anchored fails any citation whose excerpt can't be resolved to a real line in source.

The catalog replaced Gherkin. We generated Cucumber features for a while, the consuming team rejected them, and the structure-to-prose-back-to-structure round trip turned out to be its own flakiness source, so we retired the format in favor of deterministic projection. That decision is written up in Why we retired Gherkin.

A catalog excerpt, with typed semantics and provenance citations, is at /proof/catalog.

Architecture

Two factories, one rule

The production system runs as two machines. The build factory takes an approved specification and constructs software through the gated line above. The verification factory reads an existing system and produces the behavior catalog, the coverage-debt report, and the tests that close the gaps; in v3 we pointed it at our own prior work.

Builders never grade their own work. The verification factory is a separate, read-only machine: it never modifies the system it inspects, and the factory that builds is never the one that signs off.

That separation is construction, not policy. The verifier has no write path to the system it reads, so a finding can't be quietly fixed before it's reported, and a green result can't come from the thing being graded.

The software the build factory produces runs on our own platform — greenfield-ts, and greenfield-wasm for regulated work. What it is, and what you own once it's delivered, is on the platform page.

People

The human's place

An architect sets direction, reviews where judgment is irreducible, and owns the engagement. OTR Select was built on Factory v1: every line AI-written, with an architect reviewing at stations that are now automated gates. By v2 the line carried a full port of Bugzilla without those reviews. Each generation automates stations that previously needed a person, and the changelog records which ones.

What stays human is the part a gate can't decide: what should be built, and whether the catalog that defines acceptance is right. The architect's name is on the engagement, and the transcript shows what the machine did under that direction.

Reference

The record and the vocabulary

Release notes

Factory changelog

v1 built a production platform with an architect at the stations, v2 ported Bugzilla without one, and v3 verifies. Each entry carries its receipts.

Glossary

The working vocabulary

Behavior catalog, gate, parity replay, and the rest, defined once with stable anchors so other pages can cite them.

The output is easier to judge than the description.

See the proof or talk to us