🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley

Twelve Repos Down to One: The Agent Platform Restart

We paused new agent development because the platform underneath our agents had become unreliable. Twelve repositories, nine agents, and no authoritative control plane meant we could no longer answer basic operational questions with confidence: What is running? What is healthy? What failed? The fix was not another agent. It was a restart.

We consolidated the work into a single monorepo, ess-agent-platform, and defined the first release narrowly: a platform kernel with task APIs, run tracking, heartbeats, dead-letter queues, retries, and a typed worker contract. That kernel matters more than any new Slack bot feature or workflow automation because it gives operators one source of truth.

This is a practical lesson, not a theory piece. Repo sprawl had created split-brain fallbacks between database-backed and file-based inboxes, health checks that looked green while agents flapped, and silent degradation across metrics and event publishing. If your agent fleet is starting to feel harder to trust than to build, consolidation may be the right next step.

Why Twelve Repos Became Unmaintainable

TL;DR: Repo sprawl did more than slow delivery; it created operational blind spots that made failures harder to detect and fix.

We got here the way many teams do: each new agent started as its own repo because that felt tidy at the time. Sparkles, our Slack bot, had one repo. Soundwave, our email triage worker, had another. The orchestrator, shared runtime, blog pipeline, insurance monitoring, and bookkeeping flows all evolved on separate tracks, each with its own dependency versions, deployment cadence, and definition of health.

The result was fragmentation where we needed consistency.

Duplicated Logic, Divergent Behavior

Several agents imported a shared runtime package, but each repo pinned different versions. A retry fix in the shared runtime could land in one agent quickly and miss others for days or weeks. Config parsing, secret loading, and database client setup were also duplicated across repos, which invited drift.

No Single Source of Fleet Truth

With twelve repos, there was no reliable place to answer a simple question: what is actually running right now? The orchestrator tried to fill that role, but it depended on heartbeats that agents reported inconsistently. In practice, the control plane reflected what agents claimed, not what operators needed to know.

Split-Brain Fallbacks

The most damaging pattern was a silent fallback from the database-backed control plane to a legacy file-based inbox. An agent could lose access to the primary queue, continue reading old work from a local directory, and still appear healthy in the operator view. We wrote more about this in why we stopped building agents and restarted the platform, but the short version is simple: silent fallback hid real incidents.

Problem	Symptom	Root Cause
Split-brain inbox	Agent processes old tasks from filesystem while DB queue has newer work	Silent fallback in shared runtime
False health reports	Orchestrator says "all healthy" while Slack bot is flapping	Heartbeat model lacked dependency checks
Config drift	Agent A retries 3x, Agent B retries 0x for the same error class	Copy-pasted config with no central policy
Silent metric loss	Monitoring gaps during restarts with no alert	Telemetry failures were caught and suppressed
Deployment lag	Critical runtime fix takes too long to propagate	Each repo pins its own dependency versions

The Monorepo Decision

TL;DR: We chose a monorepo because our shared platform contracts are still changing, and splitting them across repos was making that instability worse.

The monorepo-versus-multi-repo debate is context-dependent. Large companies have succeeded with both models. What matters is whether your shared abstractions are mature enough to distribute safely.

Ours were not.

So the restart plan is explicit: one monorepo until the control plane, worker contract, and operator surface are stable. Only after those pieces settle do we reconsider splitting repos again. The new repo is ess-agent-platform because the name should tell operators exactly what it contains.

Isometric cutaway of two contrasting architectures side by side. LEFT side labeled "Before" shows twelve scattered small workshop buildings connected by tangled, fraying cables in orange and red, with

That naming choice also reflects a broader ESS principle: business names in code, codenames for humans. Codenames can be useful shorthand in conversation, but production systems benefit from plain language. ess-agent-platform/worker/email-triage is easier to search, debug, and explain than a clever internal nickname.

What Goes in the Platform Kernel First

TL;DR: The first milestone is infrastructure, not features: one control plane, one worker contract, and one reliable way to observe failures.

Here is the build order for the platform kernel. No new agents until these are in place:

Task and event APIs — one way to submit work and one way to emit events
Run and heartbeat tracking — every job gets a run ID, and every worker reports liveness plus dependency status
Dead-letter queue — failed tasks land somewhere visible and reviewable
Retry model — explicit, configurable, and consistent across workers
Alert model — missed heartbeats and dead-letter events trigger operator visibility quickly
Thread and session model — conversation state lives in the control plane, not in local process memory
Typed worker result contract — every worker returns a structured result
Synthetic probes — scheduled canary tasks test the platform end to end
Operator audit trail — approvals, escalations, and interventions are logged durably

The worker contract is the linchpin. In practice, it looks like this:

from dataclasses import dataclass
from typing import Literal
from datetime import datetime

@dataclass
class WorkerResult:
    run_id: str
    worker_name: str
    status: Literal["success", "failure", "needs_approval", "degraded"]
    output: dict
    started_at: datetime
    completed_at: datetime
    retry_count: int
    error_detail: str | None = None
    downstream_health: dict[str, bool] | None = None

Every worker, whether it handles email triage, bookkeeping, or Slack interactions, returns the same shape. The control plane does not need to understand each worker's domain logic to monitor it. It needs to know whether the run succeeded, failed, needs a human, or is degraded, and whether dependencies are healthy.

That kind of standardization is a well-established reliability practice even if the exact impact varies by team and tooling. Standard interfaces make alerting, retries, dashboards, and incident response more predictable.

Why We Chose Consolidation Over Frameworks

TL;DR: Frameworks can help at the application layer, but they do not replace the control plane, observability, and operator discipline a production platform still needs.

There is no shortage of agent frameworks promising faster orchestration. We evaluated several categories of tools and kept the decision grounded in our actual problem.

Framework	Strengths	Why ESS Passed or Adopted
CrewAI	Clean multi-agent orchestration and role-based definitions	Useful patterns, but it does not solve runtime authority or fleet observability
LangGraph	Durable workflows, stateful graphs, long-running task support	Promising for later, but premature before the control plane is stable
LangChain	Broad ecosystem and many integrations	Too much abstraction for a platform that still needs tighter operational discipline
OpenAI Agents SDK	Tool integration, session handling, pragmatic app-layer foundation	Adopted as a primary application-layer building block
Claude Agent SDK	Strong fit for persistent coding and research workers	Approved selectively for specialist workers, not as the platform standard

So the decision was to build the ESS platform layer ourselves: control plane, worker contracts, operator surface, and reliability model. On top of that, we can use application-layer SDKs where they fit.

The key distinction is this: our bottleneck was not getting agents to call models. It was giving operators a trustworthy picture of whether the fleet was working. Frameworks can accelerate workflows, but they do not automatically create operational truth.

File-Based Documentation as the Memory System

TL;DR: Durable operations require durable context; if a decision or handoff is not written to a tracked file, it is too easy to lose.

A less obvious cost of twelve repos was fragmented knowledge. Decisions lived in chat threads. Context lived in one engineer's memory. Plans were discussed but not recorded. Returning to a repo after two weeks often meant reconstructing context that should already have existed.

The new operating model sets a hard rule: files are the memory system. We covered the full approach in file-based documentation for agent platform memory, but the core principles are straightforward:

Stable truth and temporal truth are separate. CURRENT_STATUS.md is a living snapshot. A dated journal entry records what changed and why.
Every session leaves breadcrumbs. Work is not done when code compiles; it is done when the next person can continue without guesswork.
Each class of truth has one home. Roadmap, ADRs, and agent roster each need a single authoritative location.

This is more than process hygiene. If agents will eventually hand off work to one another, the handoff context cannot live in ephemeral chat history. File-based documentation is part of the platform, not just a team habit.

Frequently Asked Questions

Q: When should you consolidate agent repos into a monorepo instead of keeping separate services?

Consolidate when the shared pieces of your system are still unstable: control plane behavior, worker contracts, configuration policy, and observability conventions. If teams are repeatedly fixing the same runtime issue in multiple repos or struggling to answer what is healthy, a monorepo can reduce drift. Separate services make more sense once boundaries are clear and the shared platform layer changes less often.

Q: How does a typed worker contract improve reliability?

A typed contract gives every worker the same reporting shape for status, timing, retries, and dependency health. That lets the control plane apply consistent alerting, dashboards, and retry logic without custom parsing for each worker. It also makes incident review easier because operators compare like with like instead of interpreting ad hoc outputs.

Q: Why choose an application-layer SDK if the main problem is operational?

Because application-layer SDKs still solve useful problems such as tool invocation, session handling, and model integration. The mistake is expecting them to solve platform governance on their own. ESS uses SDKs where they help, but keeps control-plane authority, observability, and auditability inside the platform layer.

Q: What does a file-based operating model change day to day?

It changes the definition of done. Engineers and agents are expected to leave behind status updates, decisions, and handoff notes in version-controlled files. That reduces repeated context gathering, improves onboarding, and creates a durable record for audits and future automation.

Q: How do you prevent split-brain behavior in an agent control plane?

Do not allow silent fallback paths for core coordination mechanisms. If the primary queue or state store is unavailable, the system should enter an explicit degraded mode, emit alerts, and make the incident visible. Hidden fallback may preserve partial activity, but it destroys operator trust because the dashboard no longer matches reality.

Key Takeaways

Repo sprawl becomes an architectural problem when shared runtime behavior starts to drift.
A platform kernel should come before new agent features when operators cannot trust the fleet state.
A monorepo is a useful stabilizing move when contracts are still changing quickly.
Frameworks help with workflows, but they do not replace control-plane authority or observability.
Clear naming in code beats clever naming in production systems.
File-based documentation creates the durable context both humans and agents need.

Conclusion

The restart was less about choosing a monorepo and more about choosing operational truth over feature velocity. We stopped adding agents because the system underneath them had not earned more complexity. Consolidating into ess-agent-platform gives us one place to define contracts, one place to observe failures, and one place to rebuild trust in the platform.

If your team is seeing the same warning signs — repo drift, inconsistent health signals, silent fallback, or too much context trapped in chat — it may be time to stabilize the platform before expanding the fleet. If you want help designing that control plane, ESS can help you map the kernel, contracts, and operating model needed to make agent systems dependable.

What’s Next

The next step is wiring the first version of heartbeat and run tracking into the new monorepo. The goal is simple: when an agent stops responding, operators should know quickly and unambiguously. After that come the dead-letter queue and the first migrated worker contract, likely starting with email triage because it is one of our most stable, high-value workflows.