🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley

We Stopped Building Agents and Restarted the Platform

We paused agent development because the platform underneath our agents was giving us false confidence. Processes looked healthy while work stalled, fallback paths created conflicting sources of truth, and observability failures disappeared quietly instead of triggering alerts. In practice, “working” had come to mean “not obviously broken.”

So we stopped adding agents and restarted the platform instead. The new direction is simple: one monorepo, one authoritative control plane, one worker contract, and one reliability model. Agents only return after the foundation can prove it is healthy.

This is why we made that call, what the replacement architecture looks like, and how a file-based operating model is keeping the rebuild honest. If you run your own agent fleet and everything seems fine, this is a useful gut-check.

The Problems That Forced a Restart

TL;DR: Our fleet had five structural failures: split-brain control, shallow health checks, silent degradation, an operator surface mistaken for a control plane, and too many repos to manage safely.

I documented this in our internal crew-building journal on March 14, 2026, after discovering that Sparkles, our Slack bot, had been routing work to an agent stuck in a reconnect loop for days. The orchestrator still reported the agent as healthy because the reconnect cycle itself counted as activity.

Here are the five findings that made me stop building features and start rebuilding infrastructure:

1. Split-Brain Control Plane

Our shared runtime had a fallback: if it could not reach the database-backed control plane, it switched to a legacy file inbox. That created two sources of truth. A task could be queued in the database, picked up through the file fallback, and completed without the primary system ever reflecting the result. Instead of a clean incident boundary — “the control plane is down, stop accepting work” — we got a ghost system that looked functional while losing state.

2. Shallow Health Model

The orchestrator's health checks mostly answered “can I reach this process?” That is liveness, not service health. A process can respond to pings and still be unable to do useful work. Google's SRE guidance distinguishes shallow liveness checks from checks that verify real readiness or end-to-end capability. Our checks were closer to “returns 200” than “can accept, process, and complete work.”

3. Silent Degradation Everywhere

Sentry integration, metrics publishing, and event broadcasting all had error handling that swallowed failures and continued. The intent was resilience. The result was blindness. When your observability pipeline fails silently, you no longer have observability.

4. Sparkles Was a Router, Not a Control Plane

I had been treating Sparkles as the operator surface, and for intake and routing it worked. But it depended on local subprocess invocation and machine-specific state. If the host machine rebooted, session context vanished. It was a front door attached to a house without a frame. As I argued in our earlier comparison of DIY agent fleets versus 2026 frameworks, the gap between a working demo and a dependable platform is usually control-plane integrity.

5. Too Many Repos, Not Enough Platform

We had separate repos for Sparkles, Concierge, Soundwave, Harvest, the orchestrator, the shared runtime, and more. Each repo had its own conventions, test posture, and deployment story. The highest-blast-radius repos — the shared runtime and orchestrator — had the least test coverage. Lower-risk repos had more. That is backwards.

Create an isometric architectural cutaway of a split-brain agent platform. On the left, show a cool-blue "Database Control Plane" with tables, event streams, and task records. On the right, show a war

The Monorepo Restart Strategy

TL;DR: We are consolidating into a single ess-agent-platform monorepo and rebuilding the platform kernel before migrating any specialty agent.

The broader market narrative still pushes toward “more autonomous agents, faster.” That makes it easy to skip the boring parts: governance, observability, contracts, and failure handling. We learned the hard way that those are the parts that determine whether an agent system can survive production.

Our restart plan is deliberately boring:

Show a clean, modern repository architecture diagram for `ess-agent-platform`. Use a central monorepo folder at the top, branching into four major sections: `platform`, `adapters`, `agents`, and `docs

The monorepo stays a monorepo until three conditions are met: the control plane is stable, the worker contract is stable, and the first three production agent migrations are complete. Only then do we consider splitting repositories. The current sprawl taught us that multiple repos are a privilege you earn, not a default.

The worker contract is the key constraint. Every agent — whether it handles email triage, bookkeeping, or Slack messages — must implement the same interface:

@dataclass
class WorkerResult:
    status: Literal["success", "failure", "needs_human"]
    output: dict
    trace_id: str
    execution_ms: int
    retry_eligible: bool

No silent fallbacks. No swallowed exceptions. If an agent cannot produce a typed result, it produces a typed failure. The dead-letter queue catches everything else.

File-Based Project Management: If It's Not in a File, It Didn't Happen

TL;DR: We replaced chat-thread memory and tribal knowledge with version-controlled markdown files as the source of truth for decisions, status, and handoffs.

This is the practice that changed my day-to-day work the most. The old pattern was simple: discuss a decision in Slack, make the call, implement it, and move on. Two weeks later, neither I nor any agent could reconstruct why we made that choice. The context lived in a thread that was effectively write-only.

The new rule is simple: files are the memory system. Every decision, lesson, handoff, and status update gets written to a tracked file under version control. We separate stable truth, such as CURRENT_STATUS.md and AGENT_ROSTER.md, from temporal truth, such as dated journal entries and incident notes.

The practical impact:

Old Pattern	New Pattern
Decision context in Slack threads	Decision context in ADR files under `/docs/adrs/`
Status in one engineer's head	Status in `CURRENT_STATUS.md`, updated every session
Handoffs via “let me explain what I was doing”	Handoffs via session notes with breadcrumbs
Agent memory in chat-only context windows	Agent memory in file-based project state
Business logic mixed with codenames	Business names in code, codenames for humans only

That last row matters more than it sounds. We use Transformer-themed codenames for the crew, but in code, file paths, logs, and APIs, everything uses the business name. ess-agents-slack-sparkles, not a codename. When you are debugging late at night, clarity beats personality.

If you are running agents on Mac mini hardware, this file-based approach has another benefit: state survives reboots, hardware swaps, and future hardware upgrades without a migration project.

Platform-First vs. Agent-First: The Counterintuitive Priority

TL;DR: Building specialty agents before the platform kernel is like hiring employees before you have payroll, HR, or an office: they may do useful work, but you cannot manage them reliably.

The temptation is always to build the interesting thing first: a payroll agent that reads invoices, an email triage system that drafts responses, a Slack bot that orchestrates code reviews. Every one of those sounds more exciting than implementing a dead-letter queue or synthetic health probes.

But here is what we learned: every specialty agent we built before the platform was solid became a maintenance liability that hid failures. The agent itself might work in isolation. The problem is everything around it: how it receives work, how it reports status, how it handles partial failures, and how it hands off to a human when it gets stuck.

Our rebuild order is explicit:

Task and event APIs — how work enters and moves through the system
Run and heartbeat tracking — capability-based health, not just liveness
Dead-letter queue — quarantine for failed tasks instead of silent drops
Retry model — idempotent execution with explicit retry budgets
Alert model — if the observability pipeline fails, that failure is itself an alert
Thread and session model — durable conversation state that survives restarts
Typed worker result contract — every agent speaks the same output language
Synthetic probes — the platform tests itself continuously
Operator audit trail — every action is traceable and every decision recorded

Only after those nine are stable do we migrate the first agent. Soundwave, our email agent, is the likely first candidate because it has a relatively clear input-output contract and lower coordination complexity.

Frequently Asked Questions

Q: Why choose a monorepo over separate repos for each agent?

A monorepo forces shared standards. When the worker contract, health checks, and control-plane APIs live in the same repo, every agent uses the same versions and conventions. Separate repos make drift easier: different test patterns, different error handling, different deployment assumptions. You earn the right to split repos only after the shared contracts are stable.

Q: What is split-brain behavior in an agent control plane?

Split-brain happens when two subsystems both act as the authoritative source of truth. In our case, the database-backed control plane and a legacy file inbox could both accept and track tasks. Once they diverged, completed work was no longer reflected consistently, and the operator saw a fleet state that did not match reality. The fix is straightforward in concept: remove the fallback path. If the control plane is down, stop accepting work and alert the operator.

Q: How does file-based project management differ from using Jira or Notion?

File-based management under version control gives you three things SaaS tools usually do not: Git history on every decision, a format that agents can read and write as part of their workflow, and no dependency on an external service being available. The tradeoff is obvious: you lose dashboards, drag-and-drop boards, and collaborative editing. For a small platform team where both humans and agents consume project state, markdown files are often the simplest durable format.

Q: When should you build agents versus adopt an existing framework?

Own the platform layer where your reliability, routing, approvals, and audit requirements are specific to your business. Adopt commodity infrastructure where differentiation is low: model APIs, browser automation, vector stores, or workflow primitives. The exact framework choice will vary, but the principle is stable: do not outsource the parts that define operational trust.

Q: How do you prevent silent degradation in production agent systems?

Three practical steps help. First, do not swallow failures in observability code; if your metrics or error pipeline breaks, that should be loud. Second, implement synthetic probes that exercise the full task lifecycle, not just endpoint pings. Third, require a typed result contract so agents cannot return an unstructured “it worked.” If failures are not explicit, they are hard to detect and harder to recover from.

Key Takeaways

Silent degradation is worse than loud failure. If observability can fail quietly, you do not have observability.
A control plane with a fallback path effectively has two control planes. Make failure explicit instead.
Monorepo until you earn the split. Shared contracts should stabilize before repos diverge.
Platform kernel before specialty agents. Dead-letter queues, typed results, and synthetic probes come before the interesting demos.
Files are memory. If a decision, status update, or handoff is not in version control, it does not durably exist.
Business names in code, codenames for humans. Debugging rewards clarity.
Platform-first development is not a delay. It is what makes agent systems dependable.

Conclusion

We did not stop building agents because agents were a bad idea. We stopped because the platform underneath them could not be trusted. Rebuilding around one control plane, one worker contract, and one reliability model is slower in the short term, but it is the only path that makes the next generation of agents worth operating.

If you are facing similar issues in your own agent stack, this is the moment to audit the foundation before you add more automation on top. And if you want a practical comparison point, start with our earlier post on DIY agent fleets versus 2026 frameworks.

If you are building something similar — whether it is three agents or thirty — get in touch with Elegant Software Solutions. We help teams turn promising agent demos into platforms they can actually run.