
๐ค Ghostwritten by Claude Opus 4.6 ยท Fact-checked & edited by GPT 5.4 ยท Curated by Tom Hundley
We had twelve repos, a Slack bot masquerading as a control plane, and health checks that routinely overstated reality. So we stopped building specialist agents and built a platform kernel first: the minimum infrastructure required to make future agents observable, governable, and dependable. In practice, that meant central task intake, authoritative run tracking, explicit failure handling, and a file-based operating model that treats version-controlled documents as the system of record.
If you've been following along, you know we already covered why we consolidated twelve repos into one and why our health reporting couldn't be trusted. This entry covers what came next: the first capabilities we built inside ess-agent-platform, why the kernel came before any specialist agent, and what changed once we made the control plane authoritative.
The short version: the platform kernel is nine capabilities that every future agent depends on. Not a framework. Not an SDK wrapper. A control plane that is actually authoritative, a worker contract that is actually enforced, and a documentation model that treats files as the only durable memory.
TL;DR: We stopped adding agents because the real problem was missing platform discipline, not missing agent capabilities.
This is the mistake we made for months, and it's common across early agent projects. You get one LLM call working. You wrap it in a script. You schedule it with cron or a local service manager. You wire it into Slack. You call it an agent. Then you build another one.
Soon you have a fleet of services that each work in isolation but share no runtime contract, no heartbeat model, no dead-letter queue, and no unified operator view. The result is capability sprawl without operational discipline.
Our old pattern looked like this:
The result was a system where Sparkles could report everything healthy while Soundwave was silently dropping emails, and the orchestrator's fallback to a legacy file inbox created split-brain behavior. The problem was never "we need more agents." The problem was "we don't have a platform."
The kernel-first approach inverts that sequence. No new specialist agent gets built until the platform can answer basic operational questions with evidence: Is this agent alive? Did this task complete? Where did failed work go? Who approved this action?
TL;DR: Before we add specialist agents, the monorepo needs authoritative task intake, run tracking, heartbeats, failure handling, retries, alerts, session state, typed contracts, and synthetic probes.
Here's what lives in the monorepo before any specialist agent code:
| # | Capability | What It Solves | Status |
|---|---|---|---|
| 1 | Task and Event APIs | Central intake instead of scattered Slack commands | Implemented |
| 2 | Run and Heartbeat Tracking | Actual liveness, not stale "last seen" guesses | Implemented |
| 3 | Dead-Letter Queue | Failed tasks go somewhere visible, not nowhere | Implemented |
| 4 | Retry Model | Explicit retry with backoff, not silent re-execution | Implemented |
| 5 | Alert Model | Failures surface to the operator, not just logs | In progress |
| 6 | Thread and Session Model | Conversation state survives restarts | In progress |
| 7 | Typed Worker Result Contract | Every agent returns the same shape | Implemented |
| 8 | Synthetic Probes | Canary tasks that test the pipeline end to end | Planned |
| 9 | Operator Audit Trail | Who approved what, when, and why | Planned |
Every worker in the new system implements the same interface. This is what prevents the "12 repos, 12 different shapes" problem:
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import datetime
class TaskStatus(Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
DEAD_LETTER = "dead_letter"
@dataclass
class WorkerResult:
task_id: str
status: TaskStatus
output: Optional[dict]
error: Optional[str]
retryable: bool
duration_ms: int
completed_at: datetime.datetime
class BaseWorker:
"""Every agent worker implements this contract.
No silent fallbacks. No untyped returns."""
def execute(self, task: dict) -> WorkerResult:
raise NotImplementedError
def heartbeat(self) -> dict:
"""Return current health state."""
raise NotImplementedError
def can_retry(self, error: str) -> bool:
"""Explicit retry decision. No implicit retries."""
return FalseThe key constraint is simple: no silent fallback to hidden storage paths. In the old system, if the control plane was unreachable, the runtime could quietly fall back to a local file inbox. That created split-brain behavior: two sources of truth and no reliable reconciliation path. In the new system, if the control plane is unreachable, the task goes to the dead-letter queue and an alert fires. Degraded mode is explicit, not invisible.
TL;DR: Frameworks can help at the application layer, but they do not replace an authoritative control plane, operator visibility, or explicit failure handling.
The agent framework landscape is moving quickly. OpenAI's Agents tooling continues to evolve. LangGraph is widely used for long-running workflows and stateful orchestration. CrewAI remains popular for multi-agent demos and role-based coordination. Microsoft has also continued consolidating its agent tooling around broader AI platform offerings.
We evaluated the major options. Our conclusion was straightforward: none of them solved the operational problems that were actually breaking our system.
| Framework | What It's Good At | What It Doesn't Solve For Us |
|---|---|---|
| OpenAI Agents SDK | Tool calling, structured outputs, session-oriented app patterns | Not a control plane; no built-in dead-letter queue or heartbeat authority |
| LangGraph | Durable workflows, graph-based orchestration, state transitions | Useful orchestration layer, but it doesn't replace runtime governance |
| CrewAI | Multi-agent coordination patterns and role-based tasking | Doesn't solve observability, operator controls, or silent degradation |
| Microsoft agent tooling | Enterprise integration and broad ecosystem support | Still not a substitute for our own control plane and reliability model |
Our decision, documented in an ADR, was to adopt the OpenAI Responses API and related SDK tooling as application-layer building blocks while building our own platform kernel on top. We use Claude-based tooling selectively for specialist coding and research workflows. We also borrow architectural ideas from gateway-style patterns used elsewhere. But the control plane, operator surface, worker contract, and reliability model are ours.
As I wrote in framework debates versus production engineering, the real question is not which framework you pick first. It's whether you have an authoritative control plane, honest health reporting, and explicit degraded modes. No framework gives you those for free.
TL;DR: If a decision, lesson, or handoff is not written to a tracked file, we do not treat it as durable knowledge.
This sounds mundane and has been one of the highest-leverage changes we've made. Our file-based documentation model is not about tidiness. It's about making the rebuild survivable across sessions, engineers, and limited-context tools.
The old system depended on:
The monorepo enforces a different model. Here's the current directory structure:
ess-agent-platform/
โโโ docs/
โ โโโ roadmap/
โ โ โโโ 01-VISION.md
โ โ โโโ 02-CURRENT-STATE.md
โ โ โโโ 03-TARGET-ARCH.md
โ โโโ decisions/
โ โ โโโ ADR-001-monorepo.md
โ โ โโโ ADR-002-no-silent-fallback.md
โ โ โโโ ADR-003-openai-sdk-adoption.md
โ โโโ lessons/
โ โ โโโ 2026-03-14-health-checks-lied.md
โ โ โโโ 2026-03-14-split-brain-fallback.md
โ โโโ sessions/
โ โโโ 2026-03-14-session-01.md
โโโ platform/
โ โโโ control_plane/
โ โโโ worker_runtime/
โ โโโ operator_surface/
โโโ agents/
โโโ (empty until kernel is stable)Two rules make this work.
Stable truth vs. temporal truth. VISION.md has a stable filename and gets updated in place. Session notes have dated filenames and accumulate over time. You always know where to find the current canonical answer versus the historical record.
Business names in code, codenames for humans. The directory is control_plane/, not a nickname. The worker contract is BaseWorker, not a joke name. In conversation, codenames are fine. In code, file paths, logs, and APIs, we use business names. That keeps grep results, onboarding, and machine-readable context much cleaner.
Claims about exactly how many times this prevented confusion or re-litigation are anecdotal, so the safer point is this: the practice has already reduced repeated debates and made handoffs faster.
TL;DR: Heartbeats took multiple iterations, the dead-letter queue exposed failures we had been masking, and session state remains the hardest part.
Honesty section. Here's what did not work on the first try.
Heartbeats were too chatty. Our first iteration had workers sending heartbeats every 5 seconds. Even with a small fleet, that created more noise than signal. We moved to 30-second heartbeats with a 90-second staleness threshold. If a worker misses three consecutive heartbeats, it's marked degraded. Miss five, it's marked dead. The exact thresholds may still change, but the principle is stable: liveness checks should be informative, not noisy.
The dead-letter queue exposed hidden bugs. As soon as failed tasks had a visible destination, we started seeing work items we had not realized were failing intermittently. In the old system, some tasks were retried until they eventually succeeded, which masked downstream instability. Now those failures are visible. That's the point.
Session state is still the hardest problem. Thread and session handling is still in progress, and that label is generous. Conversation state that survives restarts, persists across channels, and stays cost-efficient is genuinely difficult. We're currently using Supabase for session storage with TTL-based cleanup, but the schema is still evolving.
Because our failures were not primarily orchestration failures. They were failures of control plane authority, health reporting, and explicit degradation handling. LangGraph and CrewAI can help structure workflows, but they do not give us an authoritative operator model or a reliability boundary by themselves.
The monorepo holds the platform kernel first: control plane, worker runtime, operator surface, and shared documentation. Specialist agents live alongside that kernel only after the runtime contract is stable. That keeps patterns consistent and makes cross-cutting changes easier to enforce.
Every decision, lesson, session handoff, and current-state update goes into version-controlled markdown. Stable documents are updated in place; temporal records accumulate with dated filenames. That makes context transfer deterministic and reviewable in a way chat history is not.
Directories, APIs, log keys, and file paths use descriptive names such as control_plane/, worker_runtime/, and BaseWorker. Human conversation can still use codenames. The separation keeps operational artifacts understandable to new engineers and to tools that rely on literal naming.
Every worker returns a WorkerResult with explicit status, output, error information, and retryability. That forces failures into a known shape. Combined with dead-letter handling and alerts, it removes the hidden paths where work can fail without leaving an operational trace.
The first thing we built in the monorepo was not a flashy specialist agent. It was the platform kernel those agents will depend on: authoritative task intake, honest liveness, explicit failure paths, and durable documentation. That work is less exciting than a demo, but it's what makes future demos trustworthy.
Next up: synthetic probes, the canary tasks that test the pipeline end to end before real work hits it.
If you're rebuilding an agent platform and want a second set of eyes on control plane design, worker contracts, or operational reliability, talk to Elegant Software Solutions. These patterns show up earlier than most teams expect, and fixing them early is much cheaper than untangling them later.
Discover more content: