
🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley
We paused agent development because the platform underneath our agents was giving us false confidence. Processes looked healthy while work stalled, fallback paths created conflicting sources of truth, and observability failures disappeared quietly instead of triggering alerts. In practice, “working” had come to mean “not obviously broken.”
So we stopped adding agents and restarted the platform instead. The new direction is simple: one monorepo, one authoritative control plane, one worker contract, and one reliability model. Agents only return after the foundation can prove it is healthy.
This is why we made that call, what the replacement architecture looks like, and how a file-based operating model is keeping the rebuild honest. If you run your own agent fleet and everything seems fine, this is a useful gut-check.
TL;DR: Our fleet had five structural failures: split-brain control, shallow health checks, silent degradation, an operator surface mistaken for a control plane, and too many repos to manage safely.
I documented this in our internal crew-building journal on March 14, 2026, after discovering that Sparkles, our Slack bot, had been routing work to an agent stuck in a reconnect loop for days. The orchestrator still reported the agent as healthy because the reconnect cycle itself counted as activity.
Here are the five findings that made me stop building features and start rebuilding infrastructure:
Our shared runtime had a fallback: if it could not reach the database-backed control plane, it switched to a legacy file inbox. That created two sources of truth. A task could be queued in the database, picked up through the file fallback, and completed without the primary system ever reflecting the result. Instead of a clean incident boundary — “the control plane is down, stop accepting work” — we got a ghost system that looked functional while losing state.
The orchestrator's health checks mostly answered “can I reach this process?” That is liveness, not service health. A process can respond to pings and still be unable to do useful work. Google's SRE guidance distinguishes shallow liveness checks from checks that verify real readiness or end-to-end capability. Our checks were closer to “returns 200” than “can accept, process, and complete work.”
Sentry integration, metrics publishing, and event broadcasting all had error handling that swallowed failures and continued. The intent was resilience. The result was blindness. When your observability pipeline fails silently, you no longer have observability.
I had been treating Sparkles as the operator surface, and for intake and routing it worked. But it depended on local subprocess invocation and machine-specific state. If the host machine rebooted, session context vanished. It was a front door attached to a house without a frame. As I argued in our earlier comparison of DIY agent fleets versus 2026 frameworks, the gap between a working demo and a dependable platform is usually control-plane integrity.
We had separate repos for Sparkles, Concierge, Soundwave, Harvest, the orchestrator, the shared runtime, and more. Each repo had its own conventions, test posture, and deployment story. The highest-blast-radius repos — the shared runtime and orchestrator — had the least test coverage. Lower-risk repos had more. That is backwards.
TL;DR: We are consolidating into a single ess-agent-platform monorepo and rebuilding the platform kernel before migrating any specialty agent.
The broader market narrative still pushes toward “more autonomous agents, faster.” That makes it easy to skip the boring parts: governance, observability, contracts, and failure handling. We learned the hard way that those are the parts that determine whether an agent system can survive production.
Our restart plan is deliberately boring:
The monorepo stays a monorepo until three conditions are met: the control plane is stable, the worker contract is stable, and the first three production agent migrations are complete. Only then do we consider splitting repositories. The current sprawl taught us that multiple repos are a privilege you earn, not a default.
The worker contract is the key constraint. Every agent — whether it handles email triage, bookkeeping, or Slack messages — must implement the same interface:
@dataclass
class WorkerResult:
status: Literal["success", "failure", "needs_human"]
output: dict
trace_id: str
execution_ms: int
retry_eligible: boolNo silent fallbacks. No swallowed exceptions. If an agent cannot produce a typed result, it produces a typed failure. The dead-letter queue catches everything else.
TL;DR: We replaced chat-thread memory and tribal knowledge with version-controlled markdown files as the source of truth for decisions, status, and handoffs.
This is the practice that changed my day-to-day work the most. The old pattern was simple: discuss a decision in Slack, make the call, implement it, and move on. Two weeks later, neither I nor any agent could reconstruct why we made that choice. The context lived in a thread that was effectively write-only.
The new rule is simple: files are the memory system. Every decision, lesson, handoff, and status update gets written to a tracked file under version control. We separate stable truth, such as CURRENT_STATUS.md and AGENT_ROSTER.md, from temporal truth, such as dated journal entries and incident notes.
The practical impact:
| Old Pattern | New Pattern |
|---|---|
| Decision context in Slack threads | Decision context in ADR files under /docs/adrs/ |
| Status in one engineer's head | Status in CURRENT_STATUS.md, updated every session |
| Handoffs via “let me explain what I was doing” | Handoffs via session notes with breadcrumbs |
| Agent memory in chat-only context windows | Agent memory in file-based project state |
| Business logic mixed with codenames | Business names in code, codenames for humans only |
That last row matters more than it sounds. We use Transformer-themed codenames for the crew, but in code, file paths, logs, and APIs, everything uses the business name. ess-agents-slack-sparkles, not a codename. When you are debugging late at night, clarity beats personality.
If you are running agents on Mac mini hardware, this file-based approach has another benefit: state survives reboots, hardware swaps, and future hardware upgrades without a migration project.
TL;DR: Building specialty agents before the platform kernel is like hiring employees before you have payroll, HR, or an office: they may do useful work, but you cannot manage them reliably.
The temptation is always to build the interesting thing first: a payroll agent that reads invoices, an email triage system that drafts responses, a Slack bot that orchestrates code reviews. Every one of those sounds more exciting than implementing a dead-letter queue or synthetic health probes.
But here is what we learned: every specialty agent we built before the platform was solid became a maintenance liability that hid failures. The agent itself might work in isolation. The problem is everything around it: how it receives work, how it reports status, how it handles partial failures, and how it hands off to a human when it gets stuck.
Our rebuild order is explicit:
Only after those nine are stable do we migrate the first agent. Soundwave, our email agent, is the likely first candidate because it has a relatively clear input-output contract and lower coordination complexity.
A monorepo forces shared standards. When the worker contract, health checks, and control-plane APIs live in the same repo, every agent uses the same versions and conventions. Separate repos make drift easier: different test patterns, different error handling, different deployment assumptions. You earn the right to split repos only after the shared contracts are stable.
Split-brain happens when two subsystems both act as the authoritative source of truth. In our case, the database-backed control plane and a legacy file inbox could both accept and track tasks. Once they diverged, completed work was no longer reflected consistently, and the operator saw a fleet state that did not match reality. The fix is straightforward in concept: remove the fallback path. If the control plane is down, stop accepting work and alert the operator.
File-based management under version control gives you three things SaaS tools usually do not: Git history on every decision, a format that agents can read and write as part of their workflow, and no dependency on an external service being available. The tradeoff is obvious: you lose dashboards, drag-and-drop boards, and collaborative editing. For a small platform team where both humans and agents consume project state, markdown files are often the simplest durable format.
Own the platform layer where your reliability, routing, approvals, and audit requirements are specific to your business. Adopt commodity infrastructure where differentiation is low: model APIs, browser automation, vector stores, or workflow primitives. The exact framework choice will vary, but the principle is stable: do not outsource the parts that define operational trust.
Three practical steps help. First, do not swallow failures in observability code; if your metrics or error pipeline breaks, that should be loud. Second, implement synthetic probes that exercise the full task lifecycle, not just endpoint pings. Third, require a typed result contract so agents cannot return an unstructured “it worked.” If failures are not explicit, they are hard to detect and harder to recover from.
We did not stop building agents because agents were a bad idea. We stopped because the platform underneath them could not be trusted. Rebuilding around one control plane, one worker contract, and one reliability model is slower in the short term, but it is the only path that makes the next generation of agents worth operating.
If you are facing similar issues in your own agent stack, this is the moment to audit the foundation before you add more automation on top. And if you want a practical comparison point, start with our earlier post on DIY agent fleets versus 2026 frameworks.
If you are building something similar — whether it is three agents or thirty — get in touch with Elegant Software Solutions. We help teams turn promising agent demos into platforms they can actually run.
Discover more content: