🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley

When Healthy Means Lying: Rebuilding Agent Trust

If your agent dashboard is green but work is still disappearing, your health model is probably measuring the wrong thing. That was our failure mode: the orchestrator confirmed that agents were alive, but not that they could complete useful work. One agent kept reconnecting and dropping messages. Another silently routed work through a legacy fallback path the dashboard never watched. From the operator's perspective, everything looked healthy. In reality, the platform had lost a trustworthy source of truth.

This is the incident pattern that forced us to stop adding agents and restart the entire platform around a single control plane, explicit degraded states, and end-to-end validation. If you're running an agent fleet—whether it's three workers or thirty—the core lesson is simple: liveness is not health, and silent fallbacks are not resilience.

The Health Model That Fooled Us

TL;DR: Our health checks proved agents were running, not that they were doing useful work, which created false confidence and hid cascading failures.

The original health model was simple: the orchestrator pinged each agent on a schedule, the agent responded with a heartbeat, and the orchestrator marked it healthy. That's a liveness check. In distributed systems, liveness only tells you that a process can answer. It does not tell you whether the process can accept work, reach dependencies, or complete a task successfully.

Here's what our health check actually verified:

# What we had — a liveness check masquerading as health
async def health_check(agent_name: str) -> HealthStatus:
    try:
        response = await agent_client.ping(agent_name, timeout=5)
        return HealthStatus.HEALTHY  # Process responded. Ship it.
    except TimeoutError:
        return HealthStatus.UNHEALTHY

And here's what was happening beneath that green status:

Agent	Orchestrator Status	Actual Behavior
Sparkles	✅ Healthy	Falling back to a file inbox, bypassing the control plane
Soundwave	✅ Healthy	Reconnecting to an email provider every 60–90 seconds, dropping messages
Concierge	✅ Healthy	Responding to pings but failing on downstream API calls
Harvest	✅ Healthy	Running, but sending errors to an observability path operators were not actively using

Kubernetes documentation has long distinguished liveness probes from readiness probes for exactly this reason: a service can be alive but unable to serve traffic. We made that category error across the fleet.

The Split-Brain Problem

The worst case was Sparkles. When the database-backed control plane was slow or unreachable, the shared runtime silently fell back to a legacy file-based inbox. That meant commands issued through Slack could follow two different execution paths depending on transient database conditions.

This is a split-brain pattern in the practical sense: two paths can accept work as if each were authoritative, while the operator sees only one of them clearly. The dashboard stayed green, but commands were diverging into systems with different monitoring, retry behavior, and auditability.

Clean editorial diagram on a dark slate background with two parallel horizontal flows. On the left, a green path labeled "Primary Control Plane" runs from Slack to Sparkles to a database queue to a wo

I've written more about how this twelve-repo sprawl collapsed into one, but the health-reporting lie was the trigger. You can tolerate a surprising amount of architectural debt. You cannot tolerate a system that misleads its operator.

Why Frameworks Don't Solve This

TL;DR: Agent frameworks help with tool use and orchestration inside an agent, but they do not replace a reliability-focused control plane.

When I audited the framework landscape in March 2026, I hoped an off-the-shelf solution had already solved this class of problem. Modern agent SDKs are getting better at tool integration, session handling, handoffs, and guardrails.

But those are capability features. Our failure was a reliability failure.

Concern	Typical agent SDK support	ESS platform direction
Tool orchestration	✅ Commonly supported	Uses SDKs where helpful
Session state	✅ Often supported	Durable, platform-owned state
Health vs. readiness distinction	❌ Usually left to the application	Synthetic probes + readiness checks
Silent degradation detection	❌ Usually left to the application	Explicit degraded-state reporting
Split-brain prevention	❌ Architectural concern outside the SDK	Single authoritative control plane
Dead-letter handling	❌ Usually custom	Dead-letter queue with alerting
Operator audit trail	❌ Usually custom	End-to-end action lineage

As I wrote in the framework debates piece, frameworks are solving real problems. They just are not solving this one. The hard question is not "How do I give an agent tools?" It's "How do I know the agent can still do its job, and what happens when it can't?"

Isometric architectural cross-section showing three horizontal layers on a dark background with warm amber and steel gray tones. Top layer labeled "Operator Surface" shows a single console with status

The Monorepo Restart: What We're Actually Building

TL;DR: We restarted in one monorepo so the platform kernel—control plane, contracts, probes, and operator visibility—could stabilize before more agents were added.

The restart follows one rule: the platform kernel earns the right to host agents, not the other way around.

The old approach was to build an interesting agent, wire it into Slack, add some logs, and assume the orchestrator would catch failures. That produced twelve repos, overlapping responsibilities, and a system where adding agent number seven made agents one through six less reliable.

The new project is a single monorepo, ess-agent-platform, and it stays unified until four conditions are met:

The control plane is stable and authoritative
The worker contract is typed and enforced
The operator surface reflects real system state
Three production agent migrations are complete

The platform kernel—what ships before any business-specific agent—includes:

# Platform kernel components (must be stable before agent migration)
PLATFORM_KERNEL = {
    "task_event_api": "Ingest, queue, and route work items",
    "run_tracking": "Every execution has a run ID, status, and lineage",
    "heartbeat_system": "Readiness probes, not just liveness pings",
    "dead_letter_queue": "Failed items go somewhere visible, not /dev/null",
    "retry_model": "Explicit retry with backoff, not silent re-execution",
    "alert_model": "Degraded mode is declared, not inferred after the fact",
    "synthetic_probes": "Known test work items validate end-to-end paths",
    "operator_audit_trail": "Every operator action is logged and queryable",
    "typed_worker_contract": "Input schema, output schema, no exceptions",
}

Synthetic probes are the key upgrade. Instead of asking, "Are you alive?" they send a known test payload through the full execution path and verify the result. If the probe fails, the agent is marked degraded even if it still answers pings.

That approach aligns with standard reliability practice: black-box checks often catch failures that process-level checks miss because they test the user-visible path, not just the process boundary.

File-Based Memory: If It's Not in a File, It Didn't Happen

TL;DR: We replaced chat-thread memory with version-controlled markdown so decisions, failure modes, and handoffs survive beyond one person's context window.

The other lesson from this failure was about memory. Not agent memory—platform memory. The decisions that led to the fallback path lived in chat threads. The reasoning behind the original health model lived in someone's head. The workarounds were tribal knowledge.

So we adopted a file-based operating model with four rules:

Files are the memory system. If a decision is not in a tracked markdown file under version control, it is not durable enough.
Stable truth and temporal truth are separate. 02-CURRENT-STATE.md is a living document. entries/2026-03-14-02-silent-fallbacks-created-split-brain.md is a dated record.
Every session leaves breadcrumbs. A work session is not done when code is committed; it is done when the next engineer can continue without guessing.
Business names in code, codenames for humans. The repo is ess-agent-platform, not a codename. Logs say email-triage-worker, not Soundwave. Human conversation can stay informal; production systems should not.

This sounds like overhead until you need to reconstruct a failure. The old system failed partly because nobody could quickly explain why the file fallback existed. Now that reasoning belongs in a file, with a date, a decision, and an owner.

Frequently Asked Questions

Q: Why build your own agent control plane instead of relying entirely on an agent SDK?

Agent SDKs are useful for tool use, conversation state, and handoffs inside an agent. They usually do not provide the operational layer needed to run a fleet: readiness checks, degraded-state signaling, dead-letter handling, audit trails, and a single authoritative control plane. If your problem is reliability across many workers, you still need platform engineering.

Q: What does split-brain mean in an agent platform?

In this context, split-brain means work can be accepted or routed through more than one path that behaves like a source of truth. That creates inconsistent monitoring, retries, and auditability. Even if both paths are technically functioning, operators lose confidence because they cannot tell which path handled which request.

Q: How does a monorepo improve reliability?

A monorepo does not magically make software reliable, but it does force shared contracts, runtime changes, and control-plane updates to evolve together. That reduces version drift and makes integration testing easier. For a platform still defining its boundaries, that discipline matters more than repo purity.

Q: What are synthetic probes, and why are they better than pings?

A ping only proves that a process can answer. A synthetic probe validates the full path by submitting a known input and checking whether the expected output appears in the right place. That means it can catch dependency failures, queueing issues, schema mismatches, and storage problems that a heartbeat will never see.

Q: How does file-based documentation reduce platform risk?

It turns operational memory into something durable, searchable, and reviewable. That matters during incidents, handoffs, and rebuilds. Good file-based documentation also creates accountability: you can see when a decision was made, why it was made, and whether the assumptions behind it still hold.

Key Takeaways

Liveness checks are not health checks. If your monitoring only confirms that a process is running, you are testing the wrong thing.
Silent fallbacks create hidden failure modes. If a system degrades without telling the operator, it is not resilient; it is misleading.
The platform should stabilize before the fleet expands. Control plane, contracts, probes, and auditability come before more agents.
Frameworks help with capability, not full-fleet reliability. Use them where they fit, but do not mistake them for an operations layer.
If the memory is not in a file, it will be lost. Durable documentation is part of reliability engineering, not a side task.

Conclusion

A green dashboard is only useful if it tells the truth. Ours didn't, and that forced a rebuild around a stricter idea of health: one control plane, explicit degraded states, typed contracts, and end-to-end probes that test real work.

If you're seeing suspiciously healthy agents in a system that still drops tasks, start by tracing the middle of the pipeline. That's where the lies usually live. And if you're planning a similar rebuild, ESS can help you design the control plane, reliability model, and operating practices that make agent fleets trustworthy at production scale.