Skip to main content
Greenfield Production Systems

Verification & behavioral audit

Know what your software does. Prove what your tests miss.

The Greenfield verification factory reads your repositories without modifying anything, then produces a behavior catalog, a tiered coverage-debt report, and the regression tests to close the gaps. In your repo, in your idiom, with every claim cited to file and line.

The problem

Three events bring teams here

AWS CTO Werner Vogels calls the underlying condition verification debt: code now gets generated faster than anyone rebuilds the comprehension to review it. Sonar's 2026 State of Code survey measured the result and named it the verification gap: 96% of developers don't fully trust AI-generated code to be functionally correct, and only 48% always verify it before committing. In practice the debt comes due as one of three events.

Your team turned on AI coding tools and review stopped keeping up.

More code lands each week than your reviewers can hold in their heads, so review quietly becomes sampling. The audit reads everything and writes down what the code actually does, behavior by behavior, each one cited to file and line.

Your test suite is green and you still don't trust releases.

A green suite proves the behaviors it covers and says nothing about the rest. The coverage-debt report enumerates that rest: every uncovered behavior, classified by the test level it actually needs.

The person who knew how it works left.

The rules still run; nobody can narrate them anymore. The behavior catalog turns that undocumented logic back into a document your team can read and approve, which is how institutional memory becomes an asset again instead of a retirement risk.

Deliverables

What you receive

Five artifacts, all of them yours to keep. Samples of each live in the evidence library; nothing below is described that you can't go inspect.

Behavior catalog

A typed, machine-checkable inventory of every observable behavior in the audited surface, with provenance citations to file and line. Browse a sample excerpt.

If you ever modernize, the approved catalog becomes the contractual acceptance criteria for the rebuild. The audit is the first step of /rebuild whether or not you take the second.

Coverage-debt report

The honest backlog: every behavior your suite doesn't cover, tiered by the test level each one actually needs. Flat coverage tools inflate the browser-test number; the catalog routes most gaps to cheaper test levels and reserves end-to-end tests for the behaviors that genuinely need them.

Generated test suites

Delivered into your QA repository, matching your conventions, your page-object structure, and your helpers. Not a vendor sandbox: your team owns and runs these suites the day we hand them over.

Enforcement matrix

Every frontend rule crossed against every backend guard, exposing rules enforced in only one layer. See a matrix sample.

Findings

Every finding ships as a Probe candidate with the exact API call that settles it. We don't promote a finding to confirmed until it runs against a live environment; neither should anyone else.

The finding format

The form disables submission when the license is expired; no corresponding guard found on the endpoint it posts to. Probe candidate

serverless-████-service/███Form.tsx:59Path partially redacted.

settles with: POST /api/████/quotes (expired license id), expect 4xx

The gates

Why machine-written tests can be trusted here

Not because we reviewed them carefully. Because every generated test has to pass a set of mechanical gates before it reaches your repo, and the transcript of those checks is available for inspection.

selectors-grounded
Rejects any selector not found in your source. The test can only reference UI that exists.
no-todo-fill
Rejects placeholder bodies and stubbed assertions. A test that asserts nothing never leaves the factory.
no-racing-count-assert
Rejects count assertions that race the UI, a common source of flaky failures in generated suites.
tsc-noemit
The generated suite must typecheck against your codebase before it counts as written.

A provenance gate greps every cited excerpt against your source instead of trusting the line number, so a citation whose text no longer matches fails the run.

Every assertion also carries a typed oracle tier. Only cross-confirmed and backend-grounded facts get stated as invariants; everything below those tiers is hedged by machinery, not by editorial care.

The discipline holds up against live systems. On the most recent estate verification, 23 of 24 generated specs ran green against staging on the first run (view the artifact) , with real logins.

Gate transcript run 2026-04-30 · estate-verify · surface 03
  1. selectors-grounded det ✓ pass 388 ms
  2. no-todo-fill det ✓ pass 121 ms
  3. no-racing-count-assert det ✓ pass 96 ms
  4. provenance:provenance-anchored det ✕ fail 540 ms
    cited excerpt does not resolve to a line in ███Service.cs; spec resubmitted with corrected span
    Annotation: The gate greps the cited excerpt rather than trusting the line number; a stale citation fails the run instead of shipping.
  5. provenance:provenance-anchored det ✓ pass 512 ms
  6. tsc-noemit det ✓ pass 2.1 s
6 gates 5 pass 1 fail det = deterministic · llm = judged against a rubric

An excerpt, not a full run. The annotated full transcript lives at /proof/transcript.

Builders never grade their own work. The verification factory is a separate, read-only machine: it never modifies the system it inspects, and the factory that builds is never the one that signs off.

Boundary

Scope and boundaries

Every catalog ships with its boundary: what was read, what wasn't, and what that means for the findings. Absence of a finding is only a claim inside the boundary. The block below is the form it takes in a delivered report.

Pricing

Three ways to buy it

The cost of not verifying already has a market name and a number. Lightrun's 2026 engineering survey puts the reliability tax at roughly 38% of the developer week spent on debugging and verification, and finds that 43% of AI-generated code changes still need debugging in production after passing QA. Each tier below is outcome-based: a fixed scope, a defined set of artifacts, and a price agreed before work starts.

Audit

One bounded surface

The behavior catalog, the coverage-debt report, and findings as probe candidates, delivered for a single scoped surface of your system.

Fixed price — scoped in one call

Estate

The full system, all domains

Catalogs and coverage debt across every repository in scope, the generated suites in your QA repo, and the enforcement matrix across layers.

Quoted

Per-release verification

Subscription

The suite re-generated and re-run on every release, with re-runs included. You get findings, not red builds.

Quoted

For diligence buyers

If you evaluate target codebases professionally, the same machinery produces a behavior-grounded technical diligence artifact in days: what the system does, what its tests actually cover, and where the frontend's promises exceed the backend's guarantees.

Contact us about a diligence engagement

For regulated teams

Teams operating under change-management or model-validation controls get audit-grade evidence by construction: a provenance-cited record of what the system does, what its tests actually cover, and the transcript of every check that was run. The verifier is read-only and structurally separate from anything that builds. Per-release re-verification is available as a subscription. If your examiners ask for evidence rather than assertions, this is that.

Contact us about evidence requirements