
🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley
If your agent dashboard is green but work is still disappearing, your health model is probably measuring the wrong thing. That was our failure mode: the orchestrator confirmed that agents were alive, but not that they could complete useful work. One agent kept reconnecting and dropping messages. Another silently routed work through a legacy fallback path the dashboard never watched. From the operator's perspective, everything looked healthy. In reality, the platform had lost a trustworthy source of truth.
This is the incident pattern that forced us to stop adding agents and restart the entire platform around a single control plane, explicit degraded states, and end-to-end validation. If you're running an agent fleet—whether it's three workers or thirty—the core lesson is simple: liveness is not health, and silent fallbacks are not resilience.
TL;DR: Our health checks proved agents were running, not that they were doing useful work, which created false confidence and hid cascading failures.
The original health model was simple: the orchestrator pinged each agent on a schedule, the agent responded with a heartbeat, and the orchestrator marked it healthy. That's a liveness check. In distributed systems, liveness only tells you that a process can answer. It does not tell you whether the process can accept work, reach dependencies, or complete a task successfully.
Here's what our health check actually verified:
# What we had — a liveness check masquerading as health
async def health_check(agent_name: str) -> HealthStatus:
try:
response = await agent_client.ping(agent_name, timeout=5)
return HealthStatus.HEALTHY # Process responded. Ship it.
except TimeoutError:
return HealthStatus.UNHEALTHYAnd here's what was happening beneath that green status:
| Agent | Orchestrator Status | Actual Behavior |
|---|---|---|
| Sparkles | ✅ Healthy | Falling back to a file inbox, bypassing the control plane |
| Soundwave | ✅ Healthy | Reconnecting to an email provider every 60–90 seconds, dropping messages |
| Concierge | ✅ Healthy | Responding to pings but failing on downstream API calls |
| Harvest | ✅ Healthy | Running, but sending errors to an observability path operators were not actively using |
Kubernetes documentation has long distinguished liveness probes from readiness probes for exactly this reason: a service can be alive but unable to serve traffic. We made that category error across the fleet.
The worst case was Sparkles. When the database-backed control plane was slow or unreachable, the shared runtime silently fell back to a legacy file-based inbox. That meant commands issued through Slack could follow two different execution paths depending on transient database conditions.
This is a split-brain pattern in the practical sense: two paths can accept work as if each were authoritative, while the operator sees only one of them clearly. The dashboard stayed green, but commands were diverging into systems with different monitoring, retry behavior, and auditability.
I've written more about how this twelve-repo sprawl collapsed into one, but the health-reporting lie was the trigger. You can tolerate a surprising amount of architectural debt. You cannot tolerate a system that misleads its operator.
TL;DR: Agent frameworks help with tool use and orchestration inside an agent, but they do not replace a reliability-focused control plane.
When I audited the framework landscape in March 2026, I hoped an off-the-shelf solution had already solved this class of problem. Modern agent SDKs are getting better at tool integration, session handling, handoffs, and guardrails.
But those are capability features. Our failure was a reliability failure.
| Concern | Typical agent SDK support | ESS platform direction |
|---|---|---|
| Tool orchestration | ✅ Commonly supported | Uses SDKs where helpful |
| Session state | ✅ Often supported | Durable, platform-owned state |
| Health vs. readiness distinction | ❌ Usually left to the application | Synthetic probes + readiness checks |
| Silent degradation detection | ❌ Usually left to the application | Explicit degraded-state reporting |
| Split-brain prevention | ❌ Architectural concern outside the SDK | Single authoritative control plane |
| Dead-letter handling | ❌ Usually custom | Dead-letter queue with alerting |
| Operator audit trail | ❌ Usually custom | End-to-end action lineage |
As I wrote in the framework debates piece, frameworks are solving real problems. They just are not solving this one. The hard question is not "How do I give an agent tools?" It's "How do I know the agent can still do its job, and what happens when it can't?"
TL;DR: We restarted in one monorepo so the platform kernel—control plane, contracts, probes, and operator visibility—could stabilize before more agents were added.
The restart follows one rule: the platform kernel earns the right to host agents, not the other way around.
The old approach was to build an interesting agent, wire it into Slack, add some logs, and assume the orchestrator would catch failures. That produced twelve repos, overlapping responsibilities, and a system where adding agent number seven made agents one through six less reliable.
The new project is a single monorepo, ess-agent-platform, and it stays unified until four conditions are met:
The platform kernel—what ships before any business-specific agent—includes:
# Platform kernel components (must be stable before agent migration)
PLATFORM_KERNEL = {
"task_event_api": "Ingest, queue, and route work items",
"run_tracking": "Every execution has a run ID, status, and lineage",
"heartbeat_system": "Readiness probes, not just liveness pings",
"dead_letter_queue": "Failed items go somewhere visible, not /dev/null",
"retry_model": "Explicit retry with backoff, not silent re-execution",
"alert_model": "Degraded mode is declared, not inferred after the fact",
"synthetic_probes": "Known test work items validate end-to-end paths",
"operator_audit_trail": "Every operator action is logged and queryable",
"typed_worker_contract": "Input schema, output schema, no exceptions",
}Synthetic probes are the key upgrade. Instead of asking, "Are you alive?" they send a known test payload through the full execution path and verify the result. If the probe fails, the agent is marked degraded even if it still answers pings.
That approach aligns with standard reliability practice: black-box checks often catch failures that process-level checks miss because they test the user-visible path, not just the process boundary.
TL;DR: We replaced chat-thread memory with version-controlled markdown so decisions, failure modes, and handoffs survive beyond one person's context window.
The other lesson from this failure was about memory. Not agent memory—platform memory. The decisions that led to the fallback path lived in chat threads. The reasoning behind the original health model lived in someone's head. The workarounds were tribal knowledge.
So we adopted a file-based operating model with four rules:
02-CURRENT-STATE.md is a living document. entries/2026-03-14-02-silent-fallbacks-created-split-brain.md is a dated record.ess-agent-platform, not a codename. Logs say email-triage-worker, not Soundwave. Human conversation can stay informal; production systems should not.This sounds like overhead until you need to reconstruct a failure. The old system failed partly because nobody could quickly explain why the file fallback existed. Now that reasoning belongs in a file, with a date, a decision, and an owner.
Agent SDKs are useful for tool use, conversation state, and handoffs inside an agent. They usually do not provide the operational layer needed to run a fleet: readiness checks, degraded-state signaling, dead-letter handling, audit trails, and a single authoritative control plane. If your problem is reliability across many workers, you still need platform engineering.
In this context, split-brain means work can be accepted or routed through more than one path that behaves like a source of truth. That creates inconsistent monitoring, retries, and auditability. Even if both paths are technically functioning, operators lose confidence because they cannot tell which path handled which request.
A monorepo does not magically make software reliable, but it does force shared contracts, runtime changes, and control-plane updates to evolve together. That reduces version drift and makes integration testing easier. For a platform still defining its boundaries, that discipline matters more than repo purity.
A ping only proves that a process can answer. A synthetic probe validates the full path by submitting a known input and checking whether the expected output appears in the right place. That means it can catch dependency failures, queueing issues, schema mismatches, and storage problems that a heartbeat will never see.
It turns operational memory into something durable, searchable, and reviewable. That matters during incidents, handoffs, and rebuilds. Good file-based documentation also creates accountability: you can see when a decision was made, why it was made, and whether the assumptions behind it still hold.
A green dashboard is only useful if it tells the truth. Ours didn't, and that forced a rebuild around a stricter idea of health: one control plane, explicit degraded states, typed contracts, and end-to-end probes that test real work.
If you're seeing suspiciously healthy agents in a system that still drops tasks, start by tracing the middle of the pipeline. That's where the lies usually live. And if you're planning a similar rebuild, ESS can help you design the control plane, reliability model, and operating practices that make agent fleets trustworthy at production scale.
Discover more content: