🤖 Ghostwritten by Claude Opus 4.8 · Fact-checked & edited by GPT 5.5

Part of the "Building the Crew" series

Structured Error Recovery Across a 12-Node Agent Fleet

The most useful reliability upgrade to a 12-node agent fleet is not always a smarter model or a faster machine. It is often a stricter failure contract: every tool call must fail in a structured, machine-readable way.

That became urgent as the ESS Mac mini farm moved into live operation with 12 nodes: two orchestrators and 10 workers. Once orchestrator agents began routing autonomous coding tasks to workers around the clock, failures stopped being occasional interruptions. A flaky Git push, a rate-limited API, a missing dependency, a worker low on disk — these are normal distributed-systems events.

The core lesson is simple: on one laptop, a vague error is annoying; across a fleet, a vague error becomes a systemic failure mode. A worker that silently drops a task creates invisible data loss. An orchestrator that retries blindly burns API budget and repeats already-completed work. An agent that escalates every failure trains humans to ignore alerts.

The fix is to make errors first-class orchestration objects: typed payloads that carry enough context for the system to retry, back off, re-route, or escalate without guessing.

Why Silent Failures Don't Scale

TL;DR: Failure modes that look harmless on one machine become invisible, compounding problems when work is fanned out across distributed workers.

On a single development machine, when an agent tool throws or returns null, a human is often watching. Someone sees the stack trace, reruns the command, and moves on. That feedback loop disappears when work is distributed to 10 worker nodes executing in parallel overnight.

Three failure patterns become especially costly:

The blind retry. A worker hits a transient network error, returns a generic exception, and the orchestrator either gives up or retries the whole task — including expensive steps that already completed.
The dropped task. A tool swallows an error, returns success-shaped output, and the work never actually happened. The failure surfaces later when a downstream task references an artifact that does not exist.
The alarm flood. Every failure routes to a human, so genuine breakage drowns in routine, recoverable noise.

Recent agent engineering guidance has converged on the same pattern: structured error returns, atomic tools, and risk-based human-in-the-loop escalation. The fleet context makes the cost of ignoring that pattern immediate.

Error Recovery Is an Orchestration Decision

TL;DR: Structured errors let an orchestrator distinguish retryable transients from real breakage, which directly affects cost, alert quality, and data integrity.

Every agent failure is really a decision request: retry, back off, route elsewhere, resume from a checkpoint, or escalate? A bare exception forces the orchestrator to infer the answer. A structured error supplies it.

That matters because autonomous coding runs are not free. Planning, editing, verification, and review all consume model context and tool time. A retry that restarts from the beginning can waste completed work. A retry that resumes from the last successful step is materially cheaper and faster.

The same structure improves escalation quality. If an orchestrator can identify a rate limit as recoverable, it can wait and retry. If it sees a missing dependency, it can avoid repeating the same doomed run. If a supposedly recoverable error repeats past a cap, it can promote the issue to a higher severity instead of looping forever.

The error payload is not just a log entry. It is a control-plane message.

The Shared Error Contract

TL;DR: Worker tools return a shared AgentToolError carrying error_code, recoverable, suggested_action, and partial_success, giving the orchestrator enough information to act.

The contract lives in a shared TypeScript package so all 12 nodes import the same definition. Treating the schema like code — versioned, reviewed, and deployed consistently — is the point. A sanitized version of the core type looks like this:

export type ErrorCategory =
  | "transient"      // network blip, rate limit — retry with backoff
  | "resource"       // disk, memory, quota — route to a healthier node
  | "validation"     // bad input — do not retry; fix upstream
  | "dependency"     // missing tool/file/service — escalate to remediate
  | "fatal";         // unknown or unrecoverable — escalate

export interface AgentToolError {
  error_code: string;           // stable, greppable, e.g. "GIT_PUSH_REJECTED"
  category: ErrorCategory;
  recoverable: boolean;         // can the orchestrator act without a human?
  suggested_action: string;     // machine-and-human-readable next step
  partial_success?: {           // what completed before failure
    completed_steps: string[];
    artifacts: string[];        // safe references only — never secrets
  };
  retry_after_ms?: number;      // hint for backoff scheduling
  context: Record<string, string>; // sanitized diagnostic key/value pairs
  schema_version: string;       // detects contract drift across nodes
}

Workers should not throw raw exceptions across the orchestration boundary. They catch, classify, and return an AgentToolError. The partial_success field is especially important because it lets a retry resume from the point of failure instead of redoing completed steps.

A typical decision tree looks like this:

transient → retry with retry_after_ms backoff and capped attempts.
resource → re-route the task to a healthier worker.
validation or dependency → avoid blind retries; route for triage.
fatal or repeated failures → escalate through the ladder to a human.

An intermediate remediation agent can handle mechanical but non-trivial follow-up — re-provisioning a dependency, opening a tracking issue, or summarizing a failure cluster — so human attention is reserved for genuinely novel breakage.

Operational Guardrails

TL;DR: Error payloads must be sanitized, retry behavior must be capped, and schema drift across nodes must be treated as an operational risk.

Do not leak through error messages. Stack traces and context maps often contain absolute file paths, environment values, internal URLs, and occasionally credentials. The diagnostic context map should be sanitized before it crosses a tool boundary. An error surfaced to an agent, log store, or dashboard is an untrusted output channel.

Do not mark everything recoverable. Over-classifying failures as recoverable: true hides real defects. If a worker labels a logic bug as transient, the orchestrator may retry until it exhausts budget without surfacing the cause. Capped retry counts and a rule that repeated recoverable failures become fatal are essential guardrails.

Do not let contracts drift. In a 12-node fleet, schema skew is easy: one worker may emit a field the orchestrator does not understand, or an orchestrator may expect a field an older worker does not provide. Version the shared package like application code and stamp each error payload with a schema version so mismatches are detectable.

Keep tools atomic. A tool that performs five unrelated actions is difficult to recover. A tool that performs one clear action can report exactly what failed and what, if anything, completed. Atomic tools make partial_success useful instead of decorative.

Managed Frameworks and the Next Step

TL;DR: Microsoft Agent Framework 1.0 provides managed orchestration patterns, while a custom contract gives teams tighter control over fleet-specific recovery behavior.

Microsoft Agent Framework 1.0 reached general availability in April 2026 with multi-agent orchestration patterns including fan-in/fan-out and structured handoffs. For teams that need managed orchestration quickly, that kind of framework can reduce the amount of plumbing required to coordinate agents.

The trade-off is control. A custom shared/ contract requires more engineering effort, but it keeps the error schema small, auditable, and shaped to the fleet's actual failure modes.

Concern	Custom shared contract	Managed framework
Error schema	Owned and versioned in-repo	Shaped by framework conventions
Retry/backoff	Explicit policy per category	Implemented through framework orchestration primitives
Escalation	Custom ladder and routing rules	Configured through framework patterns
Observability	Built to match local metrics	Often easier to integrate through built-in hooks
Portability	Lower framework coupling	Higher framework coupling

The next logical step is observability. Once every error carries a stable error_code, category, and schema version, failures can aggregate cleanly. A spike in GIT_PUSH_REJECTED, repeated resource failures on one worker class, or a rising escalation rate becomes visible as an operational signal instead of a late-night log search.

Frequently Asked Questions

TL;DR: Structured error recovery is about giving the orchestrator enough information to take the safest next action without hiding real failures.

Q: Why return structured errors instead of just throwing exceptions?

Exceptions cross process and network boundaries poorly and usually lack decision metadata. A structured AgentToolError tells the orchestrator what kind of failure happened and what action is safe: retry, back off, re-route, resume, or escalate.

Q: What does `partial_success` actually buy you?

It records which steps completed before the failure. That lets a retry resume from the failure point instead of re-running expensive completed work, which reduces wasted tool time and model context in a token-metered fleet.

Q: How do you prevent secrets from leaking into error payloads?

Treat errors as untrusted output. Sanitize diagnostic fields before they cross the tool boundary, strip credentials and host-specific paths, and allow only safe artifact references in partial_success.

Q: How is the contract kept in sync across 12 nodes?

The schema lives in a shared package, is versioned like application code, and is stamped into each error payload. The orchestrator can detect version skew instead of silently misinterpreting an outdated worker response.

Q: When should a team use a managed framework instead?

Use a managed framework when the priority is getting multi-agent orchestration running quickly and the framework's conventions fit the use case. A custom contract makes more sense when the fleet needs a tightly controlled recovery policy, bespoke escalation ladder, or minimal framework coupling.

Key Takeaways

TL;DR: Reliable agent fleets need explicit failure contracts, not vague exceptions and hope.

On a distributed fleet, vague errors cause blind retries, dropped work, and alert fatigue.
Structured errors turn failures into machine-actionable orchestration decisions.
The core payload should include error_code, category, recoverable, suggested_action, partial_success, and a schema version.
partial_success enables resume-from-failure behavior and avoids redoing completed work.
Error payloads must be sanitized because they can reach agents, logs, dashboards, and humans.
Retry caps and fatal promotion rules keep recoverable failures from masking real bugs.
Shared contracts should be versioned and deployed like code to prevent drift across nodes.

Conclusion

TL;DR: Agent reliability is built at the tool boundary as much as in the model layer.

A distributed agent fleet fails in distributed ways. The practical answer is not to eliminate failure; it is to make failure legible enough for the orchestrator to act safely.

Structured errors create that legibility. They tell the system what happened, whether recovery is safe, what action to try next, and what work has already completed. As fleets grow from one machine to many, the teams that treat error contracts as first-class engineering artifacts — versioned, sanitized, and observable — will recover faster than teams still relying on raw exceptions and log trawls.