🤖 Ghostwritten by GPT 5.4 · Fact-checked & edited by Claude Opus 4.6 · Curated by Tom Hundley

Framework vs Platform in Agent Architecture

I made a pretty unglamorous decision this week: stop chasing orchestration sophistication and go back to platform fundamentals. The short version is simple. If your agent fleet cannot tell the truth about state, cannot fail loudly, and cannot share one authoritative control plane, you do not have an agent platform architecture. You have a pile of hopeful automations.

That is the ESS situation I am rebuilding from. We already have useful agents — Sparkles as the Slack control surface, Concierge for general-purpose work, Soundwave for email, plus other domain-specific pieces. But the fleet got too wide for its maturity. We had framework-shaped conversations while the real problems were split-brain behavior, silent degradation, duplicated logic, and local-machine assumptions.

So while the market rewards heavier orchestration stories, ESS is doing something more boring and, I think, more operationally honest. We are building one platform kernel first: one control plane, one worker contract, one monorepo, one reliability model, and file-based memory under version control. If you read Framework Debates Are Over: Production Engineering Won, this is that thesis made concrete.

The fleet was too wide for its maturity

TL;DR: We did not need more agents; we needed fewer moving parts, stronger contracts, and clearer operational truth.

The most important diagnosis in the current ESS rebuild is not that any one agent was bad. It is that the overall system expanded faster than its operational backbone. That is a classic agent fleet failure mode: adding capability before hardening authority, observability, and state.

The current-state review on 2026-03-14 called out the same pattern repeatedly. The control plane was not truly authoritative. Health reporting was too optimistic. Slack-facing agents duplicated logic and lost state on restart. High-blast-radius repos lacked meaningful tests. Those are not polish issues. Those are structural issues.

If you are a developer, here is the quotable version: multi-agent sophistication built on a weak runtime is just distributed ambiguity.

Many teams hit this because frameworks make breadth feel cheap. Spinning up a new tool-using worker, channel adapter, or routing graph is usually easier than cleaning up your retry semantics or rewriting your heartbeat contract. The result is impressive demos and lousy operator trust.

Both langgraph and pyautogen remain actively maintained on PyPI, which tells you where ecosystem attention is going. But active ecosystem attention does not erase platform debt. A fast-moving tool landscape is useful for application-layer productivity, but it is not evidence that your control plane design is sound.

What I needed to admit was that our fleet had crossed the line where new agents were creating more uncertainty than leverage. That is why the restart plan explicitly recommends a monorepo development model and a platform kernel first. Not because monorepos are fashionable, but because repo sprawl had become part of the reliability problem.

What "too wide" looked like in practice

It looked like this:

Multiple repos with overlapping runtime assumptions
Slack-entry workflows that depended on local subprocess behavior
Health signals that did not match operator-visible reality
Fallback paths that preserved motion while hiding incidents
Memory trapped in chat context instead of durable files

This is also why I linked our thinking back to We Stopped Building Agents and Restarted the Platform. The stop was the feature. It created the space to decide what the real system boundary should be.

Silent fallbacks had to die

TL;DR: Silent fallback elimination is not a purity test; it is how you prevent split-brain systems from pretending they are healthy.

The ugliest problem in the old ESS setup was fallback behavior from the database-backed control plane to legacy file-based paths. On paper, that sounded resilient. In practice, it created split-brain behavior. One part of the system believed the database was authoritative. Another part quietly kept work moving elsewhere. Operators got neither a clean outage nor a clean success.

That is operational poison.

When a control plane becomes optional, it stops being a control plane. It becomes a best-effort coordination hint. That was one of the most important lessons in the rebuild — and one of the reasons the new platform work is centered on explicit degraded mode instead of hidden continuity.

Here is the anti-pattern in simplified form:

async def enqueue_task(task: Task) -> str:
    try:
        return await db_control_plane.insert_task(task)
    except Exception:
        logger.warning("control plane unavailable, falling back")
        return await file_inbox.write(task)

That code feels pragmatic right up until you need auditability, dead-letter handling, operator visibility, consistent retries, or accurate run tracking. Then it becomes a lie.

The replacement pattern is intentionally harsher:

class ControlPlaneUnavailable(RuntimeError):
    pass

async def enqueue_task(task: Task) -> str:
    try:
        return await db_control_plane.insert_task(task)
    except Exception as exc:
        metrics.increment("control_plane.unavailable")
        sentry_sdk.capture_exception(exc)
        raise ControlPlaneUnavailable(
            "Authoritative control plane unavailable; task rejected"
        ) from exc

That changes the contract in a few important ways:

One system of truth stays one system of truth
Incidents become visible immediately
Retry policy can be centralized instead of improvised
Dead-letter behavior remains coherent
The operator sees degraded mode as degraded mode

A dark-background isometric architectural scene showing three horizontal zones from top to bottom. Top zone: Operator Surface with a Slack-like control console and a human operator station glowing amb

If you want the deeper version of that lesson, Silent Fallbacks Are Lies: Building Explicit Failure Boundaries goes directly at it.

Why we picked a monorepo and business names in code

TL;DR: Monorepo development and business names in code reduce ambiguity — exactly what a rebuilding platform needs.

I am normally tolerant of repo splitting when boundaries are earned. Ours were not. We had duplication without clean ownership, codenames without operational clarity, and enough machine-specific assumptions to make restarts feel like archaeology.

That is why the restart recommendation is blunt: one new canonical project, ess-agent-platform, and keep it together until the control plane, worker contract, and operator surface stabilize.

The monorepo decision is less about Git preference and more about system comprehension. When platform code, worker contracts, adapter code, schemas, probes, and operator tooling live together, you can reason about the whole system as one thing. That matters when you are trying to eliminate hidden branches in behavior.

Here is the rough shape:

ess-agent-platform/
  apps/
    operator-console/
    control-plane-api/
    worker-supervisor/
  agents/
    messaging/
    email/
    finance/
  adapters/
    slack/
    email/
    browser/
  packages/
    runtime-contracts/
    observability/
    policy/
    storage/
  docs/
    roadmap/
    adr/
    journal/

The other decision that seems small until you run a fleet is business names in code. Human-facing codenames are fine. We still talk about Sparkles and Soundwave. But code, logs, APIs, and file paths need boring names. "email-triage-worker" is better than an internal codename when someone is debugging a failed run at 6:30 in the morning.

Why this matters operationally

Operational clarity improves when naming is literal:

Logs align with business functions
Runbooks are easier to write
Alert routing makes sense to new engineers
Onboarding stops depending on oral history

GitHub's own documentation positions monorepos as a practical fit for shared tooling, coordinated changes, and centralized policy enforcement. That is not proof every agent system should be a monorepo forever. It is evidence that, during a stabilization phase, the tradeoff is reasonable.

Framework versus platform: the actual line we drew

TL;DR: We rejected orchestration-heavy standardization at the core and chose OpenAI Agents SDK as an app-layer tool, not as the platform itself.

The external market signal right now is clear. LangGraph is widely discussed for durable multi-agent workflows. Microsoft's AutoGen has consolidated around a more integrated story. Cursor keeps pushing autonomous coding with tighter editor behavior. If you want an impressive 2026 demo stack, you have options.

We are still not making one of those the center of ESS.

That is not because they are bad. It is because our primary problem is not graph expressiveness. It is runtime authority. We do not need more orchestration until the base system can answer basic questions reliably:

What is running right now?
Which task is authoritative?
Which worker owns the retry?
Where does state live?
What is degraded versus failed?
What did the operator actually approve?

Here is the comparison table that clarified the decision:

Layer	What we need	Selected approach	Why
Control plane	Durable tasking, runs, heartbeats, audit, DLQ	Build in-house	ESS-specific operating model
Operator surface	Human intake, approvals, visibility	Build in-house via Sparkles successor	Operator trust is product-critical
Worker contract	Typed I/O, idempotency, retries, failure semantics	Build in-house	Needs one fleet-wide runtime contract
Application-layer agent behavior	Tool use, sessions, model interaction	OpenAI Responses API + Agents SDK	Pragmatic foundation without surrendering architecture
Specialist coding/research workers	Persistent coding and research tasks	Claude Agent SDK selectively	Useful exception, not the platform center
Long-running workflow watchlist	Complex graphs if later required	LangGraph watchlist	Interesting, but not day-one foundation
Broad framework abstraction	Generic multi-tool abstraction	Not standardizing on LangChain or CrewAI	Adds abstraction before discipline

That is the framework versus platform distinction. A framework can help a worker think, call tools, or maintain session semantics. A platform decides authority, visibility, policy, and truth.

OpenAI Agents SDK is probably the most useful reference point here as an architectural benchmark, especially around gateway and control-surface ideas. But we still need the ESS-specific control plane design because we are optimizing for dependable internal business operations, not just agent capability breadth.

Files are the memory system

TL;DR: If the memory is not in files under version control, it is not reliable enough for a platform rebuild.

This is one of those lessons that sounds obvious only after you get burned by the opposite. The old system depended too much on chat-thread continuity, repo-local tribal knowledge, and whatever I happened to remember from the previous week. That is not memory. That is atmosphere.

So the rebuild adopted a file-based operating model. Durable truth lives in tracked files. Stable truth and temporal truth are separated. Sessions are required to leave breadcrumbs. If something matters to platform behavior, it needs a home in version control.

That gives us a few practical wins:

Handoffs stop depending on one engineer's context window
Architecture decisions become inspectable artifacts
Journals capture failure modes before they get sanitized
Agents and humans can reference the same durable source material

A simple example is how I now want a session to end:

# Session handoff

## Implemented
- Added typed worker result envelope
- Rejected legacy file inbox fallback in runtime
- Added explicit degraded-mode error path

## Broken
- Heartbeat reconciliation still too optimistic
- Slack adapter still owns too much routing logic

## Next
- Move routing metadata into control plane
- Add synthetic probes for task enqueue and heartbeat freshness

It is not glamorous, but it beats fake continuity.

The broader point is this: production agent memory is not a bigger context window; it is a better externalized system of record. That is one reason file-based documentation remains part of the platform and not just project hygiene.

Frequently Asked Questions

Q: Why not just adopt LangGraph now and add the control plane later?

Because the control plane is the thing that decides what "later" even means. LangGraph may become useful for specific durable workflow patterns, but if task ownership, heartbeats, retries, and audit trails are weak, a graph runtime mostly gives you a more elaborate way to hide operational ambiguity.

Q: What does "one authoritative control plane" mean in practice?

It means there is exactly one system of record for task ingestion, run tracking, heartbeat state, workflow status, and operator-visible history. Workers do not silently switch to alternate state stores when that system is unavailable; they fail explicitly or enter a clearly declared degraded mode.

Q: Why is monorepo development better for ESS right now?

Because the platform kernel is still being stabilized. Keeping the operator surface, control plane, worker contracts, adapters, and early agents in one repo makes cross-cutting changes easier, reduces duplicated runtime logic, and improves operational clarity while the core contracts are still moving.

Q: Are business names in code really that important?

Yes, especially once multiple engineers or operators touch the system. Literal names improve logs, alerts, APIs, runbooks, and onboarding. Codenames are memorable for conversation, but business names in code make failures easier to understand under pressure.

Q: Did ESS reject all external frameworks?

No. ESS rejected making a broad orchestration framework the center of the platform. The current direction uses OpenAI Responses API and Agents SDK as the primary application-layer stack, Claude Agent SDK selectively for specialist coding or research workers, and keeps broader workflow frameworks on a watchlist rather than making them the foundation.

Key Takeaways

The ESS rebuild chooses platform fundamentals over orchestration breadth.
A fleet that lies about health or falls back silently is not production-ready.
Silent fallback elimination prevents split-brain behavior from hiding behind apparent resilience.
Monorepo development is a stabilization strategy, not an ideology.
Business names in code improve operational clarity across logs, APIs, and runbooks.
OpenAI Agents SDK serves as an application layer, not as the control plane.
File-based memory under version control is part of the runtime discipline, not separate from it.

Conclusion

The real decision this week was to stop pretending framework selection would solve platform weakness. It will not. The next durable step for ESS is still the boring one: one authoritative control plane, one worker contract, one operator surface, and one memory system we can actually trust.

If you are building something similar, the uncomfortable question is not "which agent framework should I adopt?" It is "what part of my system is allowed to lie?" Start there.

If you want help applying these patterns with your team, Elegant Software Solutions runs AI implementation and dev-team training around production agent systems, RAG, and control plane design. Schedule a conversation. And if you are following the rebuild, come back tomorrow — I will keep writing about the parts that broke, not just the parts that sounded smart in planning.