
๐ค Ghostwritten by GPT 5.4 ยท Fact-checked & edited by Claude Opus 4.6 ยท Curated by Tom Hundley
I made a pretty unglamorous decision this week: stop chasing orchestration sophistication and go back to platform fundamentals. The short version is simple. If your agent fleet cannot tell the truth about state, cannot fail loudly, and cannot share one authoritative control plane, you do not have an agent platform architecture. You have a pile of hopeful automations.
That is the ESS situation I am rebuilding from. We already have useful agents โ Sparkles as the Slack control surface, Concierge for general-purpose work, Soundwave for email, plus other domain-specific pieces. But the fleet got too wide for its maturity. We had framework-shaped conversations while the real problems were split-brain behavior, silent degradation, duplicated logic, and local-machine assumptions.
So while the market rewards heavier orchestration stories, ESS is doing something more boring and, I think, more operationally honest. We are building one platform kernel first: one control plane, one worker contract, one monorepo, one reliability model, and file-based memory under version control. If you read Framework Debates Are Over: Production Engineering Won, this is that thesis made concrete.
TL;DR: We did not need more agents; we needed fewer moving parts, stronger contracts, and clearer operational truth.
The most important diagnosis in the current ESS rebuild is not that any one agent was bad. It is that the overall system expanded faster than its operational backbone. That is a classic agent fleet failure mode: adding capability before hardening authority, observability, and state.
The current-state review on 2026-03-14 called out the same pattern repeatedly. The control plane was not truly authoritative. Health reporting was too optimistic. Slack-facing agents duplicated logic and lost state on restart. High-blast-radius repos lacked meaningful tests. Those are not polish issues. Those are structural issues.
If you are a developer, here is the quotable version: multi-agent sophistication built on a weak runtime is just distributed ambiguity.
Many teams hit this because frameworks make breadth feel cheap. Spinning up a new tool-using worker, channel adapter, or routing graph is usually easier than cleaning up your retry semantics or rewriting your heartbeat contract. The result is impressive demos and lousy operator trust.
Both langgraph and pyautogen remain actively maintained on PyPI, which tells you where ecosystem attention is going. But active ecosystem attention does not erase platform debt. A fast-moving tool landscape is useful for application-layer productivity, but it is not evidence that your control plane design is sound.
What I needed to admit was that our fleet had crossed the line where new agents were creating more uncertainty than leverage. That is why the restart plan explicitly recommends a monorepo development model and a platform kernel first. Not because monorepos are fashionable, but because repo sprawl had become part of the reliability problem.
It looked like this:
This is also why I linked our thinking back to We Stopped Building Agents and Restarted the Platform. The stop was the feature. It created the space to decide what the real system boundary should be.
TL;DR: Silent fallback elimination is not a purity test; it is how you prevent split-brain systems from pretending they are healthy.
The ugliest problem in the old ESS setup was fallback behavior from the database-backed control plane to legacy file-based paths. On paper, that sounded resilient. In practice, it created split-brain behavior. One part of the system believed the database was authoritative. Another part quietly kept work moving elsewhere. Operators got neither a clean outage nor a clean success.
That is operational poison.
When a control plane becomes optional, it stops being a control plane. It becomes a best-effort coordination hint. That was one of the most important lessons in the rebuild โ and one of the reasons the new platform work is centered on explicit degraded mode instead of hidden continuity.
Here is the anti-pattern in simplified form:
async def enqueue_task(task: Task) -> str:
try:
return await db_control_plane.insert_task(task)
except Exception:
logger.warning("control plane unavailable, falling back")
return await file_inbox.write(task)That code feels pragmatic right up until you need auditability, dead-letter handling, operator visibility, consistent retries, or accurate run tracking. Then it becomes a lie.
The replacement pattern is intentionally harsher:
class ControlPlaneUnavailable(RuntimeError):
pass
async def enqueue_task(task: Task) -> str:
try:
return await db_control_plane.insert_task(task)
except Exception as exc:
metrics.increment("control_plane.unavailable")
sentry_sdk.capture_exception(exc)
raise ControlPlaneUnavailable(
"Authoritative control plane unavailable; task rejected"
) from excThat changes the contract in a few important ways:
If you want the deeper version of that lesson, Silent Fallbacks Are Lies: Building Explicit Failure Boundaries goes directly at it.
TL;DR: Monorepo development and business names in code reduce ambiguity โ exactly what a rebuilding platform needs.
I am normally tolerant of repo splitting when boundaries are earned. Ours were not. We had duplication without clean ownership, codenames without operational clarity, and enough machine-specific assumptions to make restarts feel like archaeology.
That is why the restart recommendation is blunt: one new canonical project, ess-agent-platform, and keep it together until the control plane, worker contract, and operator surface stabilize.
The monorepo decision is less about Git preference and more about system comprehension. When platform code, worker contracts, adapter code, schemas, probes, and operator tooling live together, you can reason about the whole system as one thing. That matters when you are trying to eliminate hidden branches in behavior.
Here is the rough shape:
ess-agent-platform/
apps/
operator-console/
control-plane-api/
worker-supervisor/
agents/
messaging/
email/
finance/
adapters/
slack/
email/
browser/
packages/
runtime-contracts/
observability/
policy/
storage/
docs/
roadmap/
adr/
journal/The other decision that seems small until you run a fleet is business names in code. Human-facing codenames are fine. We still talk about Sparkles and Soundwave. But code, logs, APIs, and file paths need boring names. "email-triage-worker" is better than an internal codename when someone is debugging a failed run at 6:30 in the morning.
Operational clarity improves when naming is literal:
GitHub's own documentation positions monorepos as a practical fit for shared tooling, coordinated changes, and centralized policy enforcement. That is not proof every agent system should be a monorepo forever. It is evidence that, during a stabilization phase, the tradeoff is reasonable.
TL;DR: We rejected orchestration-heavy standardization at the core and chose OpenAI Agents SDK as an app-layer tool, not as the platform itself.
The external market signal right now is clear. LangGraph is widely discussed for durable multi-agent workflows. Microsoft's AutoGen has consolidated around a more integrated story. Cursor keeps pushing autonomous coding with tighter editor behavior. If you want an impressive 2026 demo stack, you have options.
We are still not making one of those the center of ESS.
That is not because they are bad. It is because our primary problem is not graph expressiveness. It is runtime authority. We do not need more orchestration until the base system can answer basic questions reliably:
Here is the comparison table that clarified the decision:
| Layer | What we need | Selected approach | Why |
|---|---|---|---|
| Control plane | Durable tasking, runs, heartbeats, audit, DLQ | Build in-house | ESS-specific operating model |
| Operator surface | Human intake, approvals, visibility | Build in-house via Sparkles successor | Operator trust is product-critical |
| Worker contract | Typed I/O, idempotency, retries, failure semantics | Build in-house | Needs one fleet-wide runtime contract |
| Application-layer agent behavior | Tool use, sessions, model interaction | OpenAI Responses API + Agents SDK | Pragmatic foundation without surrendering architecture |
| Specialist coding/research workers | Persistent coding and research tasks | Claude Agent SDK selectively | Useful exception, not the platform center |
| Long-running workflow watchlist | Complex graphs if later required | LangGraph watchlist | Interesting, but not day-one foundation |
| Broad framework abstraction | Generic multi-tool abstraction | Not standardizing on LangChain or CrewAI | Adds abstraction before discipline |
That is the framework versus platform distinction. A framework can help a worker think, call tools, or maintain session semantics. A platform decides authority, visibility, policy, and truth.
OpenAI Agents SDK is probably the most useful reference point here as an architectural benchmark, especially around gateway and control-surface ideas. But we still need the ESS-specific control plane design because we are optimizing for dependable internal business operations, not just agent capability breadth.
TL;DR: If the memory is not in files under version control, it is not reliable enough for a platform rebuild.
This is one of those lessons that sounds obvious only after you get burned by the opposite. The old system depended too much on chat-thread continuity, repo-local tribal knowledge, and whatever I happened to remember from the previous week. That is not memory. That is atmosphere.
So the rebuild adopted a file-based operating model. Durable truth lives in tracked files. Stable truth and temporal truth are separated. Sessions are required to leave breadcrumbs. If something matters to platform behavior, it needs a home in version control.
That gives us a few practical wins:
A simple example is how I now want a session to end:
# Session handoff
## Implemented
- Added typed worker result envelope
- Rejected legacy file inbox fallback in runtime
- Added explicit degraded-mode error path
## Broken
- Heartbeat reconciliation still too optimistic
- Slack adapter still owns too much routing logic
## Next
- Move routing metadata into control plane
- Add synthetic probes for task enqueue and heartbeat freshnessIt is not glamorous, but it beats fake continuity.
The broader point is this: production agent memory is not a bigger context window; it is a better externalized system of record. That is one reason file-based documentation remains part of the platform and not just project hygiene.
Because the control plane is the thing that decides what "later" even means. LangGraph may become useful for specific durable workflow patterns, but if task ownership, heartbeats, retries, and audit trails are weak, a graph runtime mostly gives you a more elaborate way to hide operational ambiguity.
It means there is exactly one system of record for task ingestion, run tracking, heartbeat state, workflow status, and operator-visible history. Workers do not silently switch to alternate state stores when that system is unavailable; they fail explicitly or enter a clearly declared degraded mode.
Because the platform kernel is still being stabilized. Keeping the operator surface, control plane, worker contracts, adapters, and early agents in one repo makes cross-cutting changes easier, reduces duplicated runtime logic, and improves operational clarity while the core contracts are still moving.
Yes, especially once multiple engineers or operators touch the system. Literal names improve logs, alerts, APIs, runbooks, and onboarding. Codenames are memorable for conversation, but business names in code make failures easier to understand under pressure.
No. ESS rejected making a broad orchestration framework the center of the platform. The current direction uses OpenAI Responses API and Agents SDK as the primary application-layer stack, Claude Agent SDK selectively for specialist coding or research workers, and keeps broader workflow frameworks on a watchlist rather than making them the foundation.
The real decision this week was to stop pretending framework selection would solve platform weakness. It will not. The next durable step for ESS is still the boring one: one authoritative control plane, one worker contract, one operator surface, and one memory system we can actually trust.
If you are building something similar, the uncomfortable question is not "which agent framework should I adopt?" It is "what part of my system is allowed to lie?" Start there.
If you want help applying these patterns with your team, Elegant Software Solutions runs AI implementation and dev-team training around production agent systems, RAG, and control plane design. Schedule a conversation. And if you are following the rebuild, come back tomorrow โ I will keep writing about the parts that broke, not just the parts that sounded smart in planning.
Discover more content: