Where Autonomy Stops: Designing the Escalation Seams in an Autonomous Software Factory

A companion guide walked the happy path through an autonomous software factory — how a coding task flows from an orchestrator to a worker agent, through a review pass, and back as a structured result. Clean pipelines make for clean diagrams. But what decides whether a software factory is safe to leave running isn't the happy path. It's the seams: the exact points where the pipeline stops being autonomous and hands control back to a human.

This is a guide to those seams. The failure handling is the real product; the code generation is almost an afterthought by comparison. So the useful framing is honest accounting: what such a pipeline genuinely automates, what stays a human's job, and exactly where the hard line between "the agents handle this" and "a human looks at this" should live.

To make this stand alone: an orchestrator decomposes a task into typed subtasks, dispatches them to single-responsibility worker agents over an async queue, collects structured results, and runs a review gate before any code is proposed for a branch. Workers run with least-privilege tool access and cannot push to main — Anthropic frames this orchestrator-workers pattern as the right call "when you can't predict the subtasks needed" (Anthropic, Building Effective Agents). That's the skeleton. Everything below is about what happens when a worker doesn't return a clean result.

Why the seams are the whole game

The industry has a name for the system with no seams at all: the dark factory, where the full lifecycle is "managed by AI agents without requiring human sign-off on individual changes" (MindStudio). It's not our target, and the reason is one sentence from that same writeup: "An AI agent that can write and ship code without review can also write and ship bad code without review." They cite an agent with inappropriate access wiping a 1.9-million-row database — the failure mode you buy when you remove the last human checkpoint.

So treat autonomy as a ladder, not a switch: read-only suggestions → draft PRs a human merges → auto-merge for low-risk changes → scoped autonomy in a sandbox → full factory (Mager, Software Factory). The defensible place to sit is rung two — the agent proposes, a human merges — and not to climb until the rung below is boringly reliable. The bottleneck isn't model capability; as Mager puts it, "The bottleneck isn't the technology — it's the process and the test coverage that makes auto-merge safe."

The three seams, and what each one actually does

A task can leave the autonomous path in three ways. Each is a designed seam, not an accident.

Seam 1 — Low confidence: escalate, don't grind

When a worker returns output it isn't sure about, retrying until it accidentally passes is how autonomous systems ship subtle breakage. The right move is confidence thresholding with human escalation: flag any result below a defined confidence bar and route it to a person (Galileo). The orchestrator gates on a numeric confidence field in the worker's result; below threshold, the task stops and posts to the escalation queue with its full trace attached.

Seam 2 — Repeated transient failure: the circuit breaker

Timeouts and rate limits are transient — they deserve a bounded retry with backoff, but "bounded" is the operative word. Open-ended retry loops are a denial-of-service attack on your own pipeline. We model this as a circuit breaker: normal flow while failures are rare, then fail fast once they cross a threshold rather than retrying forever (Galileo). Anthropic makes the same point — autonomous agents carry "the potential for compounding errors," so their guidance recommends "stopping conditions (such as a maximum number of iterations) to maintain control."

Seam 3 — Blast radius and ambiguity: the human-judgment gate

Some changes should never auto-proceed regardless of confidence, because the category of change is the risk. The safe default keeps humans involved for "novel business logic," "changes with high blast radius," and "ambiguous requirements" (MindStudio). This seam is policy, not a model decision — the orchestrator classifies the task up front and routes high-blast-radius work to a human checkpoint before a worker ever touches it. As Anthropic notes, "human review remains crucial for ensuring solutions align with broader system requirements."

The constraint behind all three seams is the same: human-in-the-loop stops being a real strategy the moment the human can't keep up with the loop. The escalation queue has to be legible, low-volume, and carry enough context that a human can rule in seconds. If every task escalates, you haven't built a factory — you've built a slower way to write code by hand.

The artifact that makes a seam debuggable: the escalation record

A seam is only useful if a human can act on it without re-running anything. So every escalation writes a structured record to disk — the decision trace, not prose. The schema we're standardizing on:

// escalation record — written to the task's session dir on any seam trip
{
  "task_id": "t-2026-06-07-0413",
  "seam": "low_confidence",          // low_confidence | circuit_open | policy_gate
  "stage": "review",                 // generate | test | review
  "worker_role": "review",           // single-responsibility role that tripped it
  "confidence": 0.41,                // numeric, gated against threshold
  "threshold": 0.70,
  "blast_radius": "medium",          // set by orchestrator classification, not the worker
  "artifact_ref": "notes/t-.../review-001.diff",
  "trace": [                         // every tool call, in order — replayable
    { "tool": "read_repo",  "ok": true },
    { "tool": "run_tests",  "ok": false, "summary": "2 failing in auth/" }
  ],
  "recommended_action": "human_review",
  "retries_used": 0,                 // 0 here: low confidence is NOT retried
  "created_at": "2026-06-07T04:13:22Z"
}

Two fields are load-bearing. retries_used is zero on a low-confidence trip by design — we retry transient failures, never low-confidence reasoning. And blast_radius is set by the orchestrator's classification, never by the worker, so a worker can't talk its own risky change down to "low."

The decision that locks the seams in: a file-based ADR

We don't keep these rules in someone's head or in a prompt. The policy is an Architecture Decision Record — a plain Markdown file in the repo, so the next session (human or agent) inherits the reasoning, not just the behavior:

# ADR-014: Escalation seams and the merge hard stop — Accepted
1. Low confidence escalates to a human; it is never retried into passing.
2. Transient failures retry under a circuit breaker, then fail fast.
3. High-blast-radius / ambiguous tasks route to a human BEFORE dispatch.
4. The merge gate is enforced by branch protection: no worker can push
   to main; a worker may only PROPOSE a pull request.

Writing the policy as a file isn't bureaucracy. It's the difference between a guardrail the system runs on and one that lives in a chat transcript nobody can find next week.

The maturity sequence: what lands first, and what lags

These pieces don't arrive at once, and it's worth being honest about the order — because the parts that land last are exactly the ones that make the system safe to leave running:

Lands first (the easy wins): the orchestrator/worker split, least-privilege tool scoping per role, branch protection as the hard merge stop, and the structured-result schema the seams gate on. This is mostly plumbing, and it's the part demos show off.
Lags (the hard part): calibrating the seams. The code paths for escalation are straightforward; knowing the right confidence threshold — and whether the queue stays low-volume under real tasks instead of crying wolf — only comes from running real work through it.
Lands last (and matters most): the escalation queue's human-facing surface. Early on, an escalation is just a record on disk that someone has to go look for. The notification-and-triage layer that lets a human actually keep up with the loop is the final, and most important, build.

The lesson: budget for the fact that the happy path is the quick 20% and the escalation machinery is the 80% that determines whether the factory is trustworthy. A pipeline that can generate and review code but can't reliably get a human's attention when it's unsure isn't finished — it's dangerous.

What's next

The next entry is the first real task through the full pipeline, plus deliberately tripping each seam to confirm it fails the way it's supposed to. A software factory you trust isn't one where the happy path is fast — it's one where you've watched it fail safely, on purpose, before you ever leave it alone.

Frequently asked questions

What is an "escalation seam" in an autonomous software factory?

It's a designed point where the pipeline stops being autonomous and hands control to a human. We use three: low-confidence output escalates instead of retrying, repeated transient failures trip a circuit breaker that fails fast, and high-blast-radius or ambiguous tasks route to a human checkpoint before any worker runs. The seams — not the code generation — make the system safe to leave running.

Why not let agents merge their own code if the tests pass?

Because passing tests don't cover novel business logic, high-blast-radius changes, or ambiguous requirements — exactly the cases where a human should still look. The industry calls the no-review version a "dark factory," and the documented failure modes (including an agent wiping a 1.9-million-row database) are why the safe posture is "agent proposes, human merges."

How do you decide whether to retry a failure or escalate it?

By failure type. Transient infrastructure failures — timeouts, rate limits — retry under a circuit breaker with a hard cap, then fail fast. Low-confidence reasoning is never retried; the escalation record makes this explicit by logging zero retries used on a low-confidence trip, on purpose.

What stops a worker agent from pushing bad code to main?

Branch protection — a hard infrastructure stop, not a sentence in a prompt. A worker can propose a pull request; it can never merge one. "Please don't merge without review" is not a security control. The rule lives in a file-based Architecture Decision Record so the policy and its reasoning survive across sessions.

What's the most important metric for a system like this?

Escalation-queue volume. If almost nothing escalates, the autonomy is real and the human can keep up. If everything escalates, either the confidence thresholds are wrong or the task decomposition is too coarse — and you've built a slower way to code by hand, not a factory.