
π€ Ghostwritten by Claude Opus 4.6 Β· Fact-checked & edited by GPT 5.4 Β· Curated by Tom Hundley
We paused new agent development because the platform underneath our agents had become unreliable. Twelve repositories, nine agents, and no authoritative control plane meant we could no longer answer basic operational questions with confidence: What is running? What is healthy? What failed? The fix was not another agent. It was a restart.
We consolidated the work into a single monorepo, ess-agent-platform, and defined the first release narrowly: a platform kernel with task APIs, run tracking, heartbeats, dead-letter queues, retries, and a typed worker contract. That kernel matters more than any new Slack bot feature or workflow automation because it gives operators one source of truth.
This is a practical lesson, not a theory piece. Repo sprawl had created split-brain fallbacks between database-backed and file-based inboxes, health checks that looked green while agents flapped, and silent degradation across metrics and event publishing. If your agent fleet is starting to feel harder to trust than to build, consolidation may be the right next step.
TL;DR: Repo sprawl did more than slow delivery; it created operational blind spots that made failures harder to detect and fix.
We got here the way many teams do: each new agent started as its own repo because that felt tidy at the time. Sparkles, our Slack bot, had one repo. Soundwave, our email triage worker, had another. The orchestrator, shared runtime, blog pipeline, insurance monitoring, and bookkeeping flows all evolved on separate tracks, each with its own dependency versions, deployment cadence, and definition of health.
The result was fragmentation where we needed consistency.
Several agents imported a shared runtime package, but each repo pinned different versions. A retry fix in the shared runtime could land in one agent quickly and miss others for days or weeks. Config parsing, secret loading, and database client setup were also duplicated across repos, which invited drift.
With twelve repos, there was no reliable place to answer a simple question: what is actually running right now? The orchestrator tried to fill that role, but it depended on heartbeats that agents reported inconsistently. In practice, the control plane reflected what agents claimed, not what operators needed to know.
The most damaging pattern was a silent fallback from the database-backed control plane to a legacy file-based inbox. An agent could lose access to the primary queue, continue reading old work from a local directory, and still appear healthy in the operator view. We wrote more about this in why we stopped building agents and restarted the platform, but the short version is simple: silent fallback hid real incidents.
| Problem | Symptom | Root Cause |
|---|---|---|
| Split-brain inbox | Agent processes old tasks from filesystem while DB queue has newer work | Silent fallback in shared runtime |
| False health reports | Orchestrator says "all healthy" while Slack bot is flapping | Heartbeat model lacked dependency checks |
| Config drift | Agent A retries 3x, Agent B retries 0x for the same error class | Copy-pasted config with no central policy |
| Silent metric loss | Monitoring gaps during restarts with no alert | Telemetry failures were caught and suppressed |
| Deployment lag | Critical runtime fix takes too long to propagate | Each repo pins its own dependency versions |
TL;DR: We chose a monorepo because our shared platform contracts are still changing, and splitting them across repos was making that instability worse.
The monorepo-versus-multi-repo debate is context-dependent. Large companies have succeeded with both models. What matters is whether your shared abstractions are mature enough to distribute safely.
Ours were not.
So the restart plan is explicit: one monorepo until the control plane, worker contract, and operator surface are stable. Only after those pieces settle do we reconsider splitting repos again. The new repo is ess-agent-platform because the name should tell operators exactly what it contains.
That naming choice also reflects a broader ESS principle: business names in code, codenames for humans. Codenames can be useful shorthand in conversation, but production systems benefit from plain language. ess-agent-platform/worker/email-triage is easier to search, debug, and explain than a clever internal nickname.
TL;DR: The first milestone is infrastructure, not features: one control plane, one worker contract, and one reliable way to observe failures.
Here is the build order for the platform kernel. No new agents until these are in place:
The worker contract is the linchpin. In practice, it looks like this:
from dataclasses import dataclass
from typing import Literal
from datetime import datetime
@dataclass
class WorkerResult:
run_id: str
worker_name: str
status: Literal["success", "failure", "needs_approval", "degraded"]
output: dict
started_at: datetime
completed_at: datetime
retry_count: int
error_detail: str | None = None
downstream_health: dict[str, bool] | None = NoneEvery worker, whether it handles email triage, bookkeeping, or Slack interactions, returns the same shape. The control plane does not need to understand each worker's domain logic to monitor it. It needs to know whether the run succeeded, failed, needs a human, or is degraded, and whether dependencies are healthy.
That kind of standardization is a well-established reliability practice even if the exact impact varies by team and tooling. Standard interfaces make alerting, retries, dashboards, and incident response more predictable.
TL;DR: Frameworks can help at the application layer, but they do not replace the control plane, observability, and operator discipline a production platform still needs.
There is no shortage of agent frameworks promising faster orchestration. We evaluated several categories of tools and kept the decision grounded in our actual problem.
| Framework | Strengths | Why ESS Passed or Adopted |
|---|---|---|
| CrewAI | Clean multi-agent orchestration and role-based definitions | Useful patterns, but it does not solve runtime authority or fleet observability |
| LangGraph | Durable workflows, stateful graphs, long-running task support | Promising for later, but premature before the control plane is stable |
| LangChain | Broad ecosystem and many integrations | Too much abstraction for a platform that still needs tighter operational discipline |
| OpenAI Agents SDK | Tool integration, session handling, pragmatic app-layer foundation | Adopted as a primary application-layer building block |
| Claude Agent SDK | Strong fit for persistent coding and research workers | Approved selectively for specialist workers, not as the platform standard |
So the decision was to build the ESS platform layer ourselves: control plane, worker contracts, operator surface, and reliability model. On top of that, we can use application-layer SDKs where they fit.
The key distinction is this: our bottleneck was not getting agents to call models. It was giving operators a trustworthy picture of whether the fleet was working. Frameworks can accelerate workflows, but they do not automatically create operational truth.
TL;DR: Durable operations require durable context; if a decision or handoff is not written to a tracked file, it is too easy to lose.
A less obvious cost of twelve repos was fragmented knowledge. Decisions lived in chat threads. Context lived in one engineer's memory. Plans were discussed but not recorded. Returning to a repo after two weeks often meant reconstructing context that should already have existed.
The new operating model sets a hard rule: files are the memory system. We covered the full approach in file-based documentation for agent platform memory, but the core principles are straightforward:
CURRENT_STATUS.md is a living snapshot. A dated journal entry records what changed and why.This is more than process hygiene. If agents will eventually hand off work to one another, the handoff context cannot live in ephemeral chat history. File-based documentation is part of the platform, not just a team habit.
Consolidate when the shared pieces of your system are still unstable: control plane behavior, worker contracts, configuration policy, and observability conventions. If teams are repeatedly fixing the same runtime issue in multiple repos or struggling to answer what is healthy, a monorepo can reduce drift. Separate services make more sense once boundaries are clear and the shared platform layer changes less often.
A typed contract gives every worker the same reporting shape for status, timing, retries, and dependency health. That lets the control plane apply consistent alerting, dashboards, and retry logic without custom parsing for each worker. It also makes incident review easier because operators compare like with like instead of interpreting ad hoc outputs.
Because application-layer SDKs still solve useful problems such as tool invocation, session handling, and model integration. The mistake is expecting them to solve platform governance on their own. ESS uses SDKs where they help, but keeps control-plane authority, observability, and auditability inside the platform layer.
It changes the definition of done. Engineers and agents are expected to leave behind status updates, decisions, and handoff notes in version-controlled files. That reduces repeated context gathering, improves onboarding, and creates a durable record for audits and future automation.
Do not allow silent fallback paths for core coordination mechanisms. If the primary queue or state store is unavailable, the system should enter an explicit degraded mode, emit alerts, and make the incident visible. Hidden fallback may preserve partial activity, but it destroys operator trust because the dashboard no longer matches reality.
The restart was less about choosing a monorepo and more about choosing operational truth over feature velocity. We stopped adding agents because the system underneath them had not earned more complexity. Consolidating into ess-agent-platform gives us one place to define contracts, one place to observe failures, and one place to rebuild trust in the platform.
If your team is seeing the same warning signs β repo drift, inconsistent health signals, silent fallback, or too much context trapped in chat β it may be time to stabilize the platform before expanding the fleet. If you want help designing that control plane, ESS can help you map the kernel, contracts, and operating model needed to make agent systems dependable.
The next step is wiring the first version of heartbeat and run tracking into the new monorepo. The goal is simple: when an agent stops responding, operators should know quickly and unambiguously. After that come the dead-letter queue and the first migrated worker contract, likely starting with email triage because it is one of our most stable, high-value workflows.
Discover more content: