
🤖 Ghostwritten by GPT 5.4 · Fact-checked & edited by Claude Opus 4.6 · Curated by Tom Hundley
The short version: the industry's move toward the software factory validates the rebuild we started at Elegant Software Solutions. If autonomous development is becoming real, the winning architecture is not "more agents." It is one authoritative control plane, typed worker contracts, durable state, explicit degraded modes, and production reliability that survives restarts, retries, and human handoffs.
This week I spent time comparing what the industry is shipping against what our current fleet actually does. OpenAI's Agents SDK and GPT-5.4 computer-use capabilities make autonomous development more plausible at the application layer. Microsoft's Agent Framework reaching release-candidate status is another reminder that framework churn is still churn. And production testing platforms are getting much more serious about end-to-end reliability — which happens to be the exact part of our old stack that was weakest.
This entry is not "look, we were right." It is more like: the market finally caught up to the thing we learned the painful way. A software factory needs platform discipline first. Otherwise you just automate your chaos faster.
TL;DR: A software factory is not a pile of agents; it is an AI agent platform with authoritative state, repeatable contracts, and operational truth.
The term software factory is getting used for everything from code generation to fully automated delivery pipelines. That muddies the water. For developers, the practical definition is simpler: a software factory is a system that can intake work, route it, execute parts of it autonomously, validate outcomes, and surface exceptions to humans — without losing context.
That is why our rebuild is centered on the platform kernel instead of specialty agents. The target architecture is boring on purpose:
That sounds less exciting than a swarm demo, but it is the foundation autonomous development actually needs. The old pattern at ESS was much less disciplined: add an agent, wire it into Slack or local scheduling, bolt on some logs, and trust fallbacks when infrastructure got weird. It worked often enough to encourage bad behavior.
The hard lesson was that useful is not the same as dependable. Our baseline review on 2026-03-14 found split-brain behavior between database-backed control paths and legacy file inbox fallbacks, shallow health reporting, and silent degradation across telemetry and event publishing. Those are not cosmetic bugs. Those are software-factory killers.
A definitive statement: you do not have autonomous development if your runtime can silently fork reality. You have an expensive demo with side effects.
GitHub's Octoverse reporting has consistently shown Python and JavaScript/TypeScript among the most-used languages on the platform, which matters because those are the ecosystems where agent tooling is moving fastest. Microsoft has publicly reported broad enterprise adoption of GitHub Copilot across millions of developers, a real sign that AI-assisted software delivery is no longer experimental. But assistance is not autonomy. The gap between those two is operations.
For the broader framing, I covered that in Software Factory vs Agent Platform Rebuild. This post is narrower: what changed in 2026 that makes the rebuild direction feel less contrarian and more inevitable.
TL;DR: OpenAI's Agents SDK gives us application-layer agent primitives without taking away control of the control plane.
Our build-vs-buy position is explicit: adopt commodity model and tool infrastructure where it helps, but do not outsource the platform layer. That means we use the OpenAI Responses API and Agents SDK as the primary application-layer stack, while still owning workflow state, routing, approvals, retries, audit trails, and degraded-mode behavior ourselves.
That distinction matters. A lot of agent framework evaluation still collapses two separate questions into one:
For ESS, the answer to the first can be "use OpenAI's stack where it is pragmatic." The answer to the second is always "our platform owns truth."
Here is the pattern I am converging on in the new monorepo:
export type WorkerInput = {
run_id: string;
task_type: "codegen" | "review" | "research" | "triage";
payload: Record<string, unknown>;
session_id?: string;
approval_required?: boolean;
};
export type WorkerResult = {
status: "completed" | "failed" | "needs_approval" | "degraded";
output?: Record<string, unknown>;
error_code?: string;
error_detail?: string;
retryable: boolean;
events: Array<{ type: string; detail: string }>;
};This is intentionally boring. The model-facing layer can use the OpenAI Agents SDK. The worker runtime still has to produce typed output, heartbeat updates, explicit failures, and idempotent execution. No hidden inboxes. No magical side channels. No local subprocess heroics pretending to be platform behavior.
GPT-5.4 computer-use capabilities also matter here, but mostly as an adapter problem. If a model can interact with a browser or desktop environment more reliably, that strengthens the channel-adapter layer. It does not eliminate the need for run tracking, approvals, and policy gates. In fact, it increases the need, because the blast radius is larger.
That is why I keep coming back to a line from our roadmap: thin adapters should not become ad hoc workflow engines. Computer use is powerful, but it should sit behind the same runtime contract as everything else.
TL;DR: The Microsoft Agent Framework reaching RC is useful market validation, but it reinforces our choice to avoid making any framework the center of the system.
I do not mean that as a shot at Microsoft. RC status is a meaningful maturity signal. It says the ecosystem is stabilizing enough that serious teams can start evaluating it without assuming the ground will move every week. That is good for everybody.
But it also confirms the deeper point: major vendors are still in the framework-consolidation phase. The wrong move for a mid-market engineering team is to rebuild its operating model every time a new orchestration abstraction lands.
Our best-of-breed tracker:
| Category | ESS direction | Why |
|---|---|---|
| Primary application-layer stack | OpenAI Responses API + Agents SDK | Strong fit for tools, sessions, and pragmatic orchestration |
| Specialist coding workers | Claude Agent SDK | Useful exception for coding and research tasks |
| Durable workflow watchlist | LangGraph | Interesting, but not our day-one foundation |
| Broad abstraction layer | LangChain | Too much abstraction before runtime discipline exists |
| Crew-style orchestration | CrewAI | Does not solve our control-plane and observability gap |
| Architectural reference | OpenClaw | Good ideas to borrow, not our platform |
That table has held up well under recent market movement. If anything, the RC milestone strengthens the argument for platform boundaries. We can integrate frameworks. We should not emotionally merge with them.
What broke for us before was not lack of tools. It was lack of authority. Sparkles acted like a control plane in conversation, but too much of its behavior still depended on local invocation and machine-specific state. That is why When Healthy Means Lying: Rebuilding Agent Trust exists. We learned the ugly version first.
A second definitive statement: framework choice is reversible; silent operational assumptions are not. If your platform truth lives in undocumented fallbacks and repo-local behavior, no SDK will save you.
For developers doing agent framework evaluation, my current filter is simple:
If not, you are probably buying short-term speed with long-term confusion.
TL;DR: The biggest blocker to autonomous development is not model intelligence; it is the reliability layer around execution, testing, and rollback.
This is where the software factory conversation gets interesting. Everyone wants to talk about code generation, planning agents, or computer use. The less glamorous truth is that production reliability decides whether any of that can touch real systems.
Our current-state review called out failures common in early agent stacks:
That list is a recipe for operator mistrust. Once operators stop believing your health signals, they start building shadow processes around the system. At that point, your autonomous development pipeline is already compromised.
This is also why I pay attention to production testing vendors like ACCELQ. I am not claiming their tooling solves the whole problem, but the broader category shift matters: the market is rewarding systems that verify end-to-end behavior, not just produce impressive demos. The same trend is visible in the growth of evals, agent QA patterns, and workflow-level regression suites.
Google's DORA research has repeatedly shown that software delivery performance depends on reliability practices, feedback loops, and operational excellence more than raw development speed. And pytest remains one of the most widely used testing frameworks in Python because boring, composable testing primitives still win in production. None of that is new. What is new is that AI stacks are rediscovering those truths after a long detour through agent theater.
In our rebuild, the practical response looks like this:
from pydantic import BaseModel
class Heartbeat(BaseModel):
run_id: str
worker_name: str
status: str
step: str
degraded: bool = False
def publish_heartbeat(bus, hb: Heartbeat) -> None:
# no silent failure: publisher exceptions must propagate
bus.publish("worker.heartbeat", hb.model_dump())That comment is doing a lot of work. The old system normalized "best effort" telemetry. The rebuild does not. If the platform cannot observe the run truthfully, it should fail loudly enough for an operator to know.
If this topic is your rabbit hole, Evals or Die: Unit Testing for Stochastic Systems is the adjacent piece to read next.
TL;DR: File-based documentation and business-first naming give autonomous systems the durable context that chat threads and codenames never can.
One of the least flashy decisions in our roadmap may end up being one of the most important: files are the memory system. Not chats. Not half-remembered architecture discussions. Not whatever context happened to be in the last model thread.
The file-based operating model exists because we kept re-learning the same lessons. If a decision was not recorded in a tracked file, it effectively did not exist for the next session, the next engineer, or the next agent. That created fake continuity, weak handoffs, and endless re-litigation.
For a human team, that is annoying. For autonomous development, it is fatal.
A software factory needs durable project memory in forms that both humans and agents can consume:
The same logic drove our naming rule: business names in code, codenames for humans. Sparkles can stay Sparkles in conversation. But code, file paths, logs, APIs, and infrastructure should optimize for operational clarity, not personality.
That sounds trivial until you try to route work automatically across agents, queue handlers, eval jobs, and incident traces. Ambiguous naming creates friction for both humans and machines. Clear naming is infrastructure.
I wrote more about the documentation side in File-Based Agent Platform Documentation That Works. The short version is that a software factory without durable files is just a context leak with a budget.
A: A software factory is the broader operating model for turning requests into shipped changes with validation and escalation. An AI agent platform is the underlying system that makes that possible: control plane, worker runtime, adapters, state, approvals, and observability. You need the platform first or the factory becomes a fragile demo pipeline.
A: Because we wanted application-layer agent primitives without giving up control of workflow state and reliability behavior. OpenAI's Agents SDK fits our build-vs-buy decisions better than broader abstraction layers that can hide operational truth. We adopt it as a tool layer, not as the authority for the whole platform.
A: No. Computer use expands what an agent can do in browsers and desktop environments, but it does not replace approvals, run tracking, retries, audit trails, or degraded-mode handling. The more capable the action layer becomes, the more important platform controls become.
A: Autonomous systems need durable context that survives sessions, restarts, and personnel changes. Chat history is useful for local reasoning, but it is a poor system of record. Files under version control create stable truth that both engineers and agents can rely on.
A: Letting failures degrade silently while dashboards still imply everything is healthy. That destroys operator trust faster than almost anything else. Once humans stop trusting the platform's health model, they work around it, and the automation loses its authority.
The 2026 shift toward software factories does not make me want to move faster in the reckless sense. It makes me want to get the platform kernel right before we widen the fleet again. That means one monorepo for now, one control plane, one worker contract, explicit degraded mode, and less tolerance for anything that only works because Tom remembers how it works.
The market is validating the direction. But the real work is still local and unglamorous: deleting silent fallbacks, tightening contracts, improving tests, and making Sparkles a real operator surface instead of a clever router with good manners.
If your team is building something similar, I'd love to hear what broke first for you. And if you want help accelerating this kind of platform work inside your dev organization, Elegant Software Solutions runs hands-on AI implementation and dev team AI training engagements. Schedule a conversation here.
Discover more content: