
Once you can run AI coding agents at all, the interesting question changes. It stops being can an agent write code? and becomes something harder: how do you structure a pipeline that generates, reviews, and ships code without a human babysitting every step? Run that pipeline across a small cluster of commodity machines โ a couple of orchestrator nodes and a handful of workers โ and the architecture matters far more than the hardware.
This is a guide to that architecture: how an orchestrator decomposes a coding task, hands it to a worker agent, gates the result through a review step before anything touches a branch, and surfaces blocked work for a human instead of silently retrying forever. We'll show the control flow in pseudocode and be precise about which design choices are load-bearing.
The naive design is a single "do everything" coding agent: hand it a ticket, let it plan, write, test, review, and merge. It demos beautifully and falls apart in production. The more tools and responsibilities you stuff into one agent, the worse its decisions get.
Anthropic's guidance is to start with the simplest thing that works โ a single agent โ and only reach for a multi-agent system when one agent is genuinely overloaded and the task value justifies the cost (multi-agent setups can burn on the order of 10โ15ร the tokens of a single chat) (Anthropic, Building Effective AI Agents). A full software-delivery loop clears that bar easily. Generation, dependency resolution, test execution, security review, and merge approval are different jobs with different risk profiles. So you decompose.
The pattern to land on is the orchestrator/worker split โ what the broader 2026 literature calls a software factory: a system that takes a specification and produces working, reviewed, deployed software through a multi-stage generate โ test โ review โ ship pipeline (Mager, Software Factory). One orchestrator owns the plan and the gates; many single-responsibility workers do the narrow jobs.
The shape of it:
Here is the core control flow โ identifiers deliberately generic, the shape rather than a config dump:
# orchestrator node โ task decomposition and dispatch
def run_task(task):
notes = SessionNotes.load_or_create(task.id) # file-based, survives a crash
if not notes.plan:
notes.plan = decompose(task) # -> typed subtasks
notes.save()
for sub in notes.pending(notes.plan):
result = dispatch(
queue="workers",
agent=sub.role, # "generate" | "test" | "review"
payload=sub,
tools=least_privilege(sub.role), # scoped per role
retry=RetryPolicy( # typed, not infinite
transient={"timeout", "rate_limit"},
max_attempts=3,
backoff="exponential",
),
)
notes.record(sub, result) # decision trace -> disk
notes.save()
if result.confidence < THRESHOLD or result.status == "blocked":
return escalate(task, reason=result, notes=notes) # human, not retry-loop
# hard stop: a worker can PROPOSE a merge; only the gate can allow one
if review_gate(notes).passed:
return open_pull_request(task, notes) # still human-approved to merge
return escalate(task, reason="review_gate_failed", notes=notes)# worker node โ single responsibility, scoped tools, structured return
def handle(payload, tools):
out = do_one_job(payload, tools) # generate OR test OR review
return Result(
status="ok" if out.ok else "blocked",
confidence=out.confidence, # orchestrator gates on this
artifact=out.artifact, # diff, test report, review notes
trace=out.trace, # every action, logged
)Two design choices in that snippet are load-bearing:
Typed retry policies, not retry loops. A worker that hits a timeout or a rate limit should retry with backoff โ that's transient. A worker that returns low-confidence output or a failing review should not retry until it accidentally passes; it should stop and escalate. We favor circuit-breaker stops over open-ended retry loops and enforce that distinction in the policy itself.
Structured output the orchestrator can validate. Workers don't return prose; they return a typed Result with a status, confidence score, artifact, and trace. The orchestrator gates on fields, not vibes. Matching response structure to what the model handles well measurably improves reliability (Anthropic, Writing tools for agents).
Autonomy without constraints is just a faster way to ship bad code. The constraints we hold to:
We are not the first to think about this. Two frameworks are the closest analogues, and it's worth being precise about where we overlap and where we don't.
CrewAI centers on role-based "crews" โ agents with explicit roles and tasks, run sequentially or hierarchically, where a manager agent can dynamically assign work (CrewAI on GitHub). The role-decomposition instinct is the same. The divergence is underneath: CrewAI is a Python framework with its own crew and flow abstractions; our split lives in a TypeScript monorepo with file-based state we control directly.
The OpenAI Agents SDK is the lightweight option โ agents, tools, handoffs, guardrails, and built-in tracing, where a handoff transfers control to a specialist and guardrails validate input and output (OpenAI Agents SDK). Its trace-everything and guardrails-as-validation ideas map closely to our decision traces and structured-result gates. The divergence is the handoff model: the SDK's specialists hand a conversation to each other, while our workers return a typed artifact to an orchestrator that owns every gate โ gate logic centralized and inspectable, not spread across handoffs. (Microsoft's Agent Framework, which absorbed AutoGen and Semantic Kernel โ both now in maintenance mode (VentureBeat) โ is a real option too, but a .NET/Python enterprise platform, the opposite of a bare-metal TypeScript build.)
So why build it yourself rather than adopt a framework? Two reasons recur. First, a monorepo you own end to end, so no framework upgrade breaks your orchestration out from under you. Second, file-based state โ plans, notes, and traces as plain files are debuggable, version-controllable, and durable across a crash. And a small naming discipline helps: name code by its domain role (orchestrator, review_gate, generation_worker), not by a framework's abstractions, so the system stays legible as it grows. If no framework gives you all of that, a focused DIY build that centralizes every gate in one inspectable orchestrator is a reasonable choice.
If you run such a pipeline on your own hardware, the cluster can do double duty: host a local, open-source LLM for routine generation, while reserving frontier API models for hard reasoning and orchestration. The split, in practice:
The hardest parts to get right are rarely the happy path. They're the edges: backpressure when the queue fills with busy workers, and making file-based notes robust enough that a mid-run crash truly resumes cleanly instead of restarting. Build the pipeline so the routine work flows automatically and the genuinely hard cases raise their hand โ that's the whole point.
What is an autonomous software factory?
A multi-agent pipeline that moves a coding task through generate โ test โ review โ ship stages with minimal human intervention. An orchestrator decomposes the work and owns the gates; single-responsibility worker agents do the narrow jobs. The goal isn't zero humans โ it's routing routine work automatically and escalating the rest.
Why an orchestrator/worker split instead of one capable coding agent?
Because a single agent's decision quality degrades as you pile on tools and responsibilities โ Anthropic's guidance is to keep things simple until one agent is genuinely overloaded and the task value justifies the higher token cost of going multi-agent. Code generation, testing, security review, and merge approval are different jobs with different risk profiles, so each gets a dedicated, narrowly-scoped agent.
How do you stop an agent from merging bad code?
The merge gate is a hard stop enforced by branch protection, not a request in a prompt. A worker can propose a pull request but never merge one; every merge needs a passing review track and, for consequential changes, explicit human approval. Prompt instructions are not a security control.
What happens when a worker returns low-confidence or blocked output?
It escalates to a human. Typed retry policies retry transient failures (timeouts, rate limits) with backoff, but low-confidence or failed-review output is not retried into eventually passing. Silent retry loops are how autonomous systems ship subtle breakage.
Why build your own pipeline instead of adopting CrewAI or the OpenAI Agents SDK?
The role-decomposition idea is shared, but a DIY build can give you a monorepo you own end to end, file-based state you can debug and version directly, and code named by domain role rather than framework abstraction. If no framework matches all three, centralizing every gate in one inspectable orchestrator is a reasonable reason to build.
How does an in-house model change the economics?
A local open-source model absorbs the high-volume routine work โ boilerplate, scaffolding, first-pass review โ with no per-token API bill. Frontier API models stay reserved for decomposition, ambiguous design calls, and final review on consequential changes โ expensive reasoning where it earns its cost, the rest on hardware we already own.
Discover more content: