
๐ค Ghostwritten by Claude Opus 4.8 ยท Fact-checked & edited by GPT 5.5
Part of the "Building the Crew" series
The most useful reliability upgrade to a 12-node agent fleet is not always a smarter model or a faster machine. It is often a stricter failure contract: every tool call must fail in a structured, machine-readable way.
That became urgent as the ESS Mac mini farm moved into live operation with 12 nodes: two orchestrators and 10 workers. Once orchestrator agents began routing autonomous coding tasks to workers around the clock, failures stopped being occasional interruptions. A flaky Git push, a rate-limited API, a missing dependency, a worker low on disk โ these are normal distributed-systems events.
The core lesson is simple: on one laptop, a vague error is annoying; across a fleet, a vague error becomes a systemic failure mode. A worker that silently drops a task creates invisible data loss. An orchestrator that retries blindly burns API budget and repeats already-completed work. An agent that escalates every failure trains humans to ignore alerts.
The fix is to make errors first-class orchestration objects: typed payloads that carry enough context for the system to retry, back off, re-route, or escalate without guessing.
TL;DR: Failure modes that look harmless on one machine become invisible, compounding problems when work is fanned out across distributed workers.
On a single development machine, when an agent tool throws or returns null, a human is often watching. Someone sees the stack trace, reruns the command, and moves on. That feedback loop disappears when work is distributed to 10 worker nodes executing in parallel overnight.
Three failure patterns become especially costly:
Recent agent engineering guidance has converged on the same pattern: structured error returns, atomic tools, and risk-based human-in-the-loop escalation. The fleet context makes the cost of ignoring that pattern immediate.
TL;DR: Structured errors let an orchestrator distinguish retryable transients from real breakage, which directly affects cost, alert quality, and data integrity.
Every agent failure is really a decision request: retry, back off, route elsewhere, resume from a checkpoint, or escalate? A bare exception forces the orchestrator to infer the answer. A structured error supplies it.
That matters because autonomous coding runs are not free. Planning, editing, verification, and review all consume model context and tool time. A retry that restarts from the beginning can waste completed work. A retry that resumes from the last successful step is materially cheaper and faster.
The same structure improves escalation quality. If an orchestrator can identify a rate limit as recoverable, it can wait and retry. If it sees a missing dependency, it can avoid repeating the same doomed run. If a supposedly recoverable error repeats past a cap, it can promote the issue to a higher severity instead of looping forever.
The error payload is not just a log entry. It is a control-plane message.
TL;DR: Worker tools return a shared AgentToolError carrying error_code, recoverable, suggested_action, and partial_success, giving the orchestrator enough information to act.
The contract lives in a shared TypeScript package so all 12 nodes import the same definition. Treating the schema like code โ versioned, reviewed, and deployed consistently โ is the point. A sanitized version of the core type looks like this:
export type ErrorCategory =
| "transient" // network blip, rate limit โ retry with backoff
| "resource" // disk, memory, quota โ route to a healthier node
| "validation" // bad input โ do not retry; fix upstream
| "dependency" // missing tool/file/service โ escalate to remediate
| "fatal"; // unknown or unrecoverable โ escalate
export interface AgentToolError {
error_code: string; // stable, greppable, e.g. "GIT_PUSH_REJECTED"
category: ErrorCategory;
recoverable: boolean; // can the orchestrator act without a human?
suggested_action: string; // machine-and-human-readable next step
partial_success?: { // what completed before failure
completed_steps: string[];
artifacts: string[]; // safe references only โ never secrets
};
retry_after_ms?: number; // hint for backoff scheduling
context: Record<string, string>; // sanitized diagnostic key/value pairs
schema_version: string; // detects contract drift across nodes
}Workers should not throw raw exceptions across the orchestration boundary. They catch, classify, and return an AgentToolError. The partial_success field is especially important because it lets a retry resume from the point of failure instead of redoing completed steps.
A typical decision tree looks like this:
transient โ retry with retry_after_ms backoff and capped attempts.resource โ re-route the task to a healthier worker.validation or dependency โ avoid blind retries; route for triage.fatal or repeated failures โ escalate through the ladder to a human.An intermediate remediation agent can handle mechanical but non-trivial follow-up โ re-provisioning a dependency, opening a tracking issue, or summarizing a failure cluster โ so human attention is reserved for genuinely novel breakage.
TL;DR: Error payloads must be sanitized, retry behavior must be capped, and schema drift across nodes must be treated as an operational risk.
Do not leak through error messages. Stack traces and context maps often contain absolute file paths, environment values, internal URLs, and occasionally credentials. The diagnostic context map should be sanitized before it crosses a tool boundary. An error surfaced to an agent, log store, or dashboard is an untrusted output channel.
Do not mark everything recoverable. Over-classifying failures as recoverable: true hides real defects. If a worker labels a logic bug as transient, the orchestrator may retry until it exhausts budget without surfacing the cause. Capped retry counts and a rule that repeated recoverable failures become fatal are essential guardrails.
Do not let contracts drift. In a 12-node fleet, schema skew is easy: one worker may emit a field the orchestrator does not understand, or an orchestrator may expect a field an older worker does not provide. Version the shared package like application code and stamp each error payload with a schema version so mismatches are detectable.
Keep tools atomic. A tool that performs five unrelated actions is difficult to recover. A tool that performs one clear action can report exactly what failed and what, if anything, completed. Atomic tools make partial_success useful instead of decorative.
TL;DR: Microsoft Agent Framework 1.0 provides managed orchestration patterns, while a custom contract gives teams tighter control over fleet-specific recovery behavior.
Microsoft Agent Framework 1.0 reached general availability in April 2026 with multi-agent orchestration patterns including fan-in/fan-out and structured handoffs. For teams that need managed orchestration quickly, that kind of framework can reduce the amount of plumbing required to coordinate agents.
The trade-off is control. A custom shared/ contract requires more engineering effort, but it keeps the error schema small, auditable, and shaped to the fleet's actual failure modes.
| Concern | Custom shared contract | Managed framework |
|---|---|---|
| Error schema | Owned and versioned in-repo | Shaped by framework conventions |
| Retry/backoff | Explicit policy per category | Implemented through framework orchestration primitives |
| Escalation | Custom ladder and routing rules | Configured through framework patterns |
| Observability | Built to match local metrics | Often easier to integrate through built-in hooks |
| Portability | Lower framework coupling | Higher framework coupling |
The next logical step is observability. Once every error carries a stable error_code, category, and schema version, failures can aggregate cleanly. A spike in GIT_PUSH_REJECTED, repeated resource failures on one worker class, or a rising escalation rate becomes visible as an operational signal instead of a late-night log search.
TL;DR: Structured error recovery is about giving the orchestrator enough information to take the safest next action without hiding real failures.
Exceptions cross process and network boundaries poorly and usually lack decision metadata. A structured AgentToolError tells the orchestrator what kind of failure happened and what action is safe: retry, back off, re-route, resume, or escalate.
It records which steps completed before the failure. That lets a retry resume from the failure point instead of re-running expensive completed work, which reduces wasted tool time and model context in a token-metered fleet.
Treat errors as untrusted output. Sanitize diagnostic fields before they cross the tool boundary, strip credentials and host-specific paths, and allow only safe artifact references in partial_success.
The schema lives in a shared package, is versioned like application code, and is stamped into each error payload. The orchestrator can detect version skew instead of silently misinterpreting an outdated worker response.
Use a managed framework when the priority is getting multi-agent orchestration running quickly and the framework's conventions fit the use case. A custom contract makes more sense when the fleet needs a tightly controlled recovery policy, bespoke escalation ladder, or minimal framework coupling.
TL;DR: Reliable agent fleets need explicit failure contracts, not vague exceptions and hope.
error_code, category, recoverable, suggested_action, partial_success, and a schema version.partial_success enables resume-from-failure behavior and avoids redoing completed work.TL;DR: Agent reliability is built at the tool boundary as much as in the model layer.
A distributed agent fleet fails in distributed ways. The practical answer is not to eliminate failure; it is to make failure legible enough for the orchestrator to act safely.
Structured errors create that legibility. They tell the system what happened, whether recovery is safe, what action to try next, and what work has already completed. As fleets grow from one machine to many, the teams that treat error contracts as first-class engineering artifacts โ versioned, sanitized, and observable โ will recover faster than teams still relying on raw exceptions and log trawls.
Discover more content: