Building an Autonomous Software Factory: Wiring an Orchestrator and Worker Agents

Once you can run AI coding agents at all, the interesting question changes. It stops being can an agent write code? and becomes something harder: how do you structure a pipeline that generates, reviews, and ships code without a human babysitting every step? Run that pipeline across a small cluster of commodity machines — a couple of orchestrator nodes and a handful of workers — and the architecture matters far more than the hardware.

This is a guide to that architecture: how an orchestrator decomposes a coding task, hands it to a worker agent, gates the result through a review step before anything touches a branch, and surfaces blocked work for a human instead of silently retrying forever. We'll show the control flow in pseudocode and be precise about which design choices are load-bearing.

The design (and why not one giant agent)

The naive design is a single "do everything" coding agent: hand it a ticket, let it plan, write, test, review, and merge. It demos beautifully and falls apart in production. The more tools and responsibilities you stuff into one agent, the worse its decisions get.

Anthropic's guidance is to start with the simplest thing that works — a single agent — and only reach for a multi-agent system when one agent is genuinely overloaded and the task value justifies the cost (multi-agent setups can burn on the order of 10–15× the tokens of a single chat) (Anthropic, Building Effective AI Agents). A full software-delivery loop clears that bar easily. Generation, dependency resolution, test execution, security review, and merge approval are different jobs with different risk profiles. So you decompose.

The pattern to land on is the orchestrator/worker split — what the broader 2026 literature calls a software factory: a system that takes a specification and produces working, reviewed, deployed software through a multi-stage generate → test → review → ship pipeline (Mager, Software Factory). One orchestrator owns the plan and the gates; many single-responsibility workers do the narrow jobs.

How to wire the pipeline

The shape of it:

Orchestrator node — decomposes a task into typed subtasks, dispatches them to a worker queue, collects structured results, runs the review gate, and decides what to escalate. It never writes feature code itself.
Worker nodes — each runs a single-responsibility agent (generation, test, or review) with a least-privilege tool set scoped to exactly that job.
Async task queue — the orchestrator doesn't block on a worker; it enqueues a subtask, the worker picks it up, and the result returns as a structured payload the orchestrator can validate.
File-based session notes — every task writes its plan, decisions, and intermediate state to disk, so a process that dies mid-run resumes from the notes instead of starting over.
Approval gate — no code lands on main without passing a review pass, enforced in infrastructure (branch protection), not as a polite instruction in a prompt.

Here is the core control flow — identifiers deliberately generic, the shape rather than a config dump:

# orchestrator node — task decomposition and dispatch
def run_task(task):
    notes = SessionNotes.load_or_create(task.id)   # file-based, survives a crash

    if not notes.plan:
        notes.plan = decompose(task)               # -> typed subtasks
        notes.save()

    for sub in notes.pending(notes.plan):
        result = dispatch(
            queue="workers",
            agent=sub.role,                         # "generate" | "test" | "review"
            payload=sub,
            tools=least_privilege(sub.role),        # scoped per role
            retry=RetryPolicy(                       # typed, not infinite
                transient={"timeout", "rate_limit"},
                max_attempts=3,
                backoff="exponential",
            ),
        )
        notes.record(sub, result)                   # decision trace -> disk
        notes.save()

        if result.confidence < THRESHOLD or result.status == "blocked":
            return escalate(task, reason=result, notes=notes)   # human, not retry-loop

    # hard stop: a worker can PROPOSE a merge; only the gate can allow one
    if review_gate(notes).passed:
        return open_pull_request(task, notes)       # still human-approved to merge
    return escalate(task, reason="review_gate_failed", notes=notes)

# worker node — single responsibility, scoped tools, structured return
def handle(payload, tools):
    out = do_one_job(payload, tools)                # generate OR test OR review
    return Result(
        status="ok" if out.ok else "blocked",
        confidence=out.confidence,                  # orchestrator gates on this
        artifact=out.artifact,                      # diff, test report, review notes
        trace=out.trace,                            # every action, logged
    )

Two design choices in that snippet are load-bearing:

Typed retry policies, not retry loops. A worker that hits a timeout or a rate limit should retry with backoff — that's transient. A worker that returns low-confidence output or a failing review should not retry until it accidentally passes; it should stop and escalate. We favor circuit-breaker stops over open-ended retry loops and enforce that distinction in the policy itself.

Structured output the orchestrator can validate. Workers don't return prose; they return a typed Result with a status, confidence score, artifact, and trace. The orchestrator gates on fields, not vibes. Matching response structure to what the model handles well measurably improves reliability (Anthropic, Writing tools for agents).

The security model: least privilege and a hard merge stop

Autonomy without constraints is just a faster way to ship bad code. The constraints we hold to:

Least-privilege tool access per worker. A generation worker can read the repo and write to a scratch workspace — it cannot push, reach production secrets, or call deployment APIs. A review worker can read a diff and a test report and nothing else. Each agent's tool set is the minimal map of actions it must perform, everything else denied by default — the zero-trust posture Anthropic recommends, applied per role.
No worker can push to main. The merge gate is a hard stop enforced by branch protection, not a sentence in a system prompt. A worker can propose a pull request; it can never merge one. "Please don't merge without review" is not a security control — branch protection is.
Every action carries a decision trace. Each worker returns, and each orchestrator step records to disk, a trace of what was done and why — both an audit log and a debugging tool. When a task goes sideways, we can replay exactly which agent made which call.
Human escalation over silent failure. Low-confidence or blocked output routes to a person. The pipeline's job is to do the routine work and raise its hand on the rest — not grind in a loop pretending it's fine.

Where this diverges from CrewAI and the OpenAI Agents SDK

We are not the first to think about this. Two frameworks are the closest analogues, and it's worth being precise about where we overlap and where we don't.

CrewAI centers on role-based "crews" — agents with explicit roles and tasks, run sequentially or hierarchically, where a manager agent can dynamically assign work (CrewAI on GitHub). The role-decomposition instinct is the same. The divergence is underneath: CrewAI is a Python framework with its own crew and flow abstractions; our split lives in a TypeScript monorepo with file-based state we control directly.

The OpenAI Agents SDK is the lightweight option — agents, tools, handoffs, guardrails, and built-in tracing, where a handoff transfers control to a specialist and guardrails validate input and output (OpenAI Agents SDK). Its trace-everything and guardrails-as-validation ideas map closely to our decision traces and structured-result gates. The divergence is the handoff model: the SDK's specialists hand a conversation to each other, while our workers return a typed artifact to an orchestrator that owns every gate — gate logic centralized and inspectable, not spread across handoffs. (Microsoft's Agent Framework, which absorbed AutoGen and Semantic Kernel — both now in maintenance mode (VentureBeat) — is a real option too, but a .NET/Python enterprise platform, the opposite of a bare-metal TypeScript build.)

So why build it yourself rather than adopt a framework? Two reasons recur. First, a monorepo you own end to end, so no framework upgrade breaks your orchestration out from under you. Second, file-based state — plans, notes, and traces as plain files are debuggable, version-controllable, and durable across a crash. And a small naming discipline helps: name code by its domain role (orchestrator, review_gate, generation_worker), not by a framework's abstractions, so the system stays legible as it grows. If no framework gives you all of that, a focused DIY build that centralizes every gate in one inspectable orchestrator is a reasonable choice.

A model-split that controls cost

If you run such a pipeline on your own hardware, the cluster can do double duty: host a local, open-source LLM for routine generation, while reserving frontier API models for hard reasoning and orchestration. The split, in practice:

Local open-source model — boilerplate, mechanical refactors, test scaffolding, first-pass review. High volume, low ambiguity, no per-token API bill.
Frontier API model — task decomposition, ambiguous design calls, the final review pass on anything consequential. Low volume, high stakes, worth the cost.

The hardest parts to get right are rarely the happy path. They're the edges: backpressure when the queue fills with busy workers, and making file-based notes robust enough that a mid-run crash truly resumes cleanly instead of restarting. Build the pipeline so the routine work flows automatically and the genuinely hard cases raise their hand — that's the whole point.

Frequently asked questions

What is an autonomous software factory?

A multi-agent pipeline that moves a coding task through generate → test → review → ship stages with minimal human intervention. An orchestrator decomposes the work and owns the gates; single-responsibility worker agents do the narrow jobs. The goal isn't zero humans — it's routing routine work automatically and escalating the rest.

Why an orchestrator/worker split instead of one capable coding agent?

Because a single agent's decision quality degrades as you pile on tools and responsibilities — Anthropic's guidance is to keep things simple until one agent is genuinely overloaded and the task value justifies the higher token cost of going multi-agent. Code generation, testing, security review, and merge approval are different jobs with different risk profiles, so each gets a dedicated, narrowly-scoped agent.

How do you stop an agent from merging bad code?

The merge gate is a hard stop enforced by branch protection, not a request in a prompt. A worker can propose a pull request but never merge one; every merge needs a passing review track and, for consequential changes, explicit human approval. Prompt instructions are not a security control.

What happens when a worker returns low-confidence or blocked output?

It escalates to a human. Typed retry policies retry transient failures (timeouts, rate limits) with backoff, but low-confidence or failed-review output is not retried into eventually passing. Silent retry loops are how autonomous systems ship subtle breakage.

Why build your own pipeline instead of adopting CrewAI or the OpenAI Agents SDK?

The role-decomposition idea is shared, but a DIY build can give you a monorepo you own end to end, file-based state you can debug and version directly, and code named by domain role rather than framework abstraction. If no framework matches all three, centralizing every gate in one inspectable orchestrator is a reasonable reason to build.

How does an in-house model change the economics?

A local open-source model absorbs the high-volume routine work — boilerplate, scaffolding, first-pass review — with no per-token API bill. Frontier API models stay reserved for decomposition, ambiguous design calls, and final review on consequential changes — expensive reasoning where it earns its cost, the rest on hardware we already own.