
🤖 Ghostwritten by GPT 5.4 · Fact-checked & edited by Claude Opus 4.6 · Curated by Tom Hundley
Last week I had the slightly uncomfortable realization that our diy-agent-fleet is no longer competing with toy demos. It is competing with real production-ai-agents frameworks that matured fast: CrewAI added crewai-streaming tool calls in January 2026, Microsoft is consolidating AutoGen into a unified microsoft-agent-framework targeting general availability in Q1 2026, and LangGraph keeps showing up in serious enterprise builds. The short answer: I do not think we should throw away our bare-metal-agents stack. I do think the frameworks have caught up in the places that actually matter in production: memory, checkpoints, observability, and deterministic control.
So my current verdict is simple. Keep the core fleet. Do not rush into a rewrite. But steal aggressively from the best ideas in CrewAI, LangGraph, and Microsoft’s stack. Sparkles, Concierge, and Soundwave already prove that launchd orchestration on our Mac mini fleet works. What we’re missing is less about raw capability and more about production hardening: better state management, human-in-the-loop gates, and a first-class execution graph.
This post is my honest agent-framework-comparison after building the thing the hard way.
TL;DR: Bare-metal-agents still win on control, debuggability, cost visibility, and the ability to shape execution around your actual business instead of a framework’s assumptions.
The reason we built this stack ourselves was not ideology. It was irritation. I wanted Sparkles, Concierge, and Soundwave to behave like software systems, not prompt soup with a mascot. So we went with a simple architecture: Python services, launchd-managed processes on Mac minis, a queue-driven handoff pattern, explicit tool wrappers, and logs I can actually trace without opening five dashboards.
That decision has aged better than I expected.
In our stack, a tool call is just code. Sparkles receives a Slack request, Concierge decides whether it needs retrieval or execution, and the tool layer runs a sanitized adapter with schema validation around inputs and outputs. There is no mystery hidden inside an opaque orchestrator.
A simplified version looks like this:
from pydantic import BaseModel, Field
class SearchDocsInput(BaseModel):
query: str = Field(min_length=3)
top_k: int = Field(default=5, ge=1, le=20)
class SearchDocsOutput(BaseModel):
snippets: list[str]
async def search_docs_tool(args: SearchDocsInput) -> SearchDocsOutput:
results = await retriever.search(args.query, k=args.top_k)
return SearchDocsOutput(snippets=[r.text for r in results])That is boring in the best possible way. When it fails, I know where to look.
I would not call launchd elegant. I would call it stubborn. And in production, stubborn is underrated. If a worker dies, it comes back. If a queue consumer wedges, I can inspect it like any other daemonized service.
That matters more now that the fleet is scaling across the hardware I described in Mac Mini Fleet Upgrade for AI Agent Hardware. Fancy orchestration is great right up until you need to explain a cascading failure at 2:13 a.m.
When you run bare-metal-agents, you see the real cost structure. Model calls are still the expensive part, but orchestration overhead stays low and predictable. You also get flexibility to optimize local preprocessing, embeddings, retries, and background jobs close to the metal.
According to the Stack Overflow Developer Survey 2024, Docker and Python remain among the most common tools in professional developer workflows, which matters because boring infrastructure usually wins in maintainability. According to the 2024 CNCF Annual Survey, containers and platform standardization continue to dominate production operations, reinforcing the same point: teams want repeatable control, not magical abstractions.
The frameworks are getting better, but our core instinct was right: production-ai-agents need software engineering discipline first and framework convenience second.
TL;DR: CrewAI, LangGraph, and Microsoft’s unified agent stack are ahead in stateful orchestration, guardrails, and enterprise-ready operational patterns.
This is the part where I stop congratulating myself.
I think we got lulled into believing that because our agents worked, our architecture was mature. Those are not the same thing. What 2026 frameworks have done well is package a bunch of painful lessons into defaults.
CrewAI’s January 2026 crewai-streaming release matters because tool execution no longer feels like a silent hang between “thinking” and “done.” For any agent touching external systems, streaming intermediate tool progress is not fluff. It is operational feedback.
Soundwave is the clearest example. When it triages a mailbox, fetches context, drafts a response, and waits for approval, the worst part today is the dead air during external calls. We can replicate streaming ourselves, but CrewAI made it a first-class concept instead of an afterthought.
If you read our piece on State Management: Why Chatbots Forget (And How to Fix It), you already know my bias: vector memory is not state. LangGraph’s big advantage in langgraph-enterprise settings is that it treats workflows like graphs with persisted transitions, not just conversational loops.
That is exactly where our current stack is weak. Concierge can recover context through retrieval, but recovery is not the same as resumability. If a workflow pauses at “await human review,” state should be a durable machine-readable object, not a reconstructed guess from logs and embeddings.
Microsoft’s move to merge AutoGen into a more unified microsoft-agent-framework is not interesting because of branding. It is interesting because enterprise buyers hate fragmented ecosystems. A unified framework with stronger identity, governance, and integration stories will appeal to IT leaders who already live in Microsoft infrastructure.
If your environment is deeply tied to Azure, Entra, Microsoft 365, and enterprise compliance controls, that matters. We do not have to love it for it to be a real market force.
Here’s the comparison I wish I had six months ago:
| Capability | Our DIY fleet | CrewAI 2026 | LangGraph enterprise patterns | Microsoft agent framework |
|---|---|---|---|---|
| Tool calling | Explicit wrappers, fully customizable | Strong, now improved with streaming | Strong when embedded in graph nodes | Strong for enterprise integration scenarios |
| Stateful workflows | Partial, mostly app-managed | Moderate | Excellent, graph-first | Likely strong for governed enterprise flows |
| Human checkpoints | Ad hoc but workable | Supported through workflow design | Excellent fit | Strong potential in enterprise review chains |
| Debuggability | Very high if you own the stack | Good, framework-dependent | Good, especially for graph inspection | Good, likely best in Microsoft-heavy shops |
| Infrastructure control | Highest | Moderate | Moderate | Lowest if you want full bare-metal control |
| Migration cost | None | Medium | Medium to high | High if your stack is not Microsoft-centric |
A real stat worth noting: LangChain reported in 2024 that LangGraph was being adopted for production use cases requiring durable execution and controllable agent state. Separately, Microsoft’s public documentation and roadmap signals around AutoGen’s consolidation show the market is converging on fewer, more governed agent abstractions rather than more experimental ones.
The frameworks are not winning because they are smarter. They are winning because they are codifying boring production lessons.
TL;DR: Our agents are useful, but compared to modern frameworks they rely too much on custom glue for memory, retries, and recovery.
Let me be unfair to my own system for a minute.
Sparkles works because Slack interactions are naturally bounded. A user asks for something, Sparkles routes it, and we send back a result. But when requests span multiple tools, the execution trace is still too hand-built. We log the path, but we do not expose a first-class workflow state model.
That is manageable at small scale. It gets ugly when concurrency rises and someone asks, “Why did this branch execute before approval arrived?”
Concierge is our Swiss Army knife, which is another way of saying it is where sloppiness hides. It can retrieve docs, call internal services, chain tasks, and hand off to other agents. That flexibility is useful, but it also creates too many implicit transitions.
The anti-pattern is simple: if an agent can do everything, it starts deciding too much. Frameworks like LangGraph force you to externalize transitions as nodes and edges. We let too much live inside prompt logic and Python branching.
This is the same architectural pressure I talked about in Designing Agent Workflows: Architecture for AI Automation. The more business-critical the workflow becomes, the less you want invisible agent reasoning controlling state transitions.
Soundwave handles email workflows, and email is where edge cases breed like fruit flies. Message threading, missing context, attachment handling, draft approvals, and retryable delivery failures all benefit from explicit state machines.
If I were going to pilot a framework migration, Soundwave would be first. Not because it is broken, but because its workflow shape already looks like a graph:
class EmailJobState(TypedDict):
message_id: str
thread_id: str
retrieved_context: list[str]
draft: str | None
approval_status: str
send_status: str
error: str | NoneThat data model should exist independently of the model prompt. Right now, too much of it is scattered across task records, retrieval artifacts, and callback handlers.
Our retries are sensible, but they are not systematic enough. Some tools back off cleanly. Some recover by replaying the task. Some just fail loudly and need human babysitting.
That is not a framework problem; that is an architecture maturity problem. It also connects directly to Debugging AI Agents: Monitoring and Observability Guide, because production-ai-agents fail in ways normal applications do not. You need traceable intent, not just stack traces.
TL;DR: We should keep the diy-agent-fleet core, add graph-based orchestration for selected workflows, and avoid a full rewrite unless a framework clearly reduces operational pain.
I think there are three realistic options.
I do not recommend this right now. Rewriting a working fleet into a framework usually feels clean in architecture diagrams and terrible in month three. You pay migration cost before you collect reliability gains.
Also not ideal. This path protects our control, but it risks rebuilding every lesson the ecosystem is now standardizing. That is engineer catnip and business debt.
This is my current recommendation. Keep our transport, process supervision, hardware layout, and tool contracts. Introduce a graph runtime where the workflow complexity justifies it.
The three pillars of production RAG agents are durable state, deterministic guardrails, and observable tool execution. That is the line I keep coming back to.
A practical migration plan would look like this:
According to GitHub’s 2024 Octoverse reporting, Python remains one of the most used languages on the platform, which supports staying close to a Python-first control plane. And as Elegant Software Solutions has seen in our own builds, the most expensive failures are rarely model failures alone; they are coordination failures between tools, state, and humans.
That also lines up with where our broader stack is headed in From Agent Fleet to Software Factory: Building the Prompt-to-Production Pipeline. The future is not one giant super-agent. It is a disciplined system of smaller agents with explicit contracts.
TL;DR: No rewrite, no hype cycle panic; just targeted hardening where frameworks have exposed our weak spots.
Here is the concrete to-do list I’m carrying into next week:
Concierge and Soundwave both need durable typed state objects that survive restarts, handoffs, and approvals.
Even if we do not adopt CrewAI directly, crewai-streaming highlighted the usability gap. Users and operators should see retrieval, tool start, tool finish, approval wait, and retry events in real time.
For code, send paths through a reviewer stage similar to the pattern we described in The 'Reviewer Pattern': Automated QA for Agent Code. For communications and external actions, require explicit approval thresholds.
Embeddings are retrieval aids. State is workflow truth. We knew this intellectually; now it needs to be enforced structurally.
I am leaning toward a LangGraph pilot for Soundwave because email workflows map naturally to graph execution. If it reduces failure handling and makes recovery more legible, great. If not, we keep the lesson and move on.
Not by default. If your current system works and your team can debug it, a full rewrite is usually a bad trade. The better move is to identify where your pain lives—state, observability, approvals, retries—and adopt framework ideas or components only where they reduce operational friction.
The biggest weakness is usually not model quality; it is implicit state. Teams often mistake vector retrieval for durable memory, then discover their agents cannot reliably resume, pause, or recover long-running workflows. Production systems need explicit state transitions and auditable checkpoints.
It helps anywhere tool latency creates ambiguity. If an agent is searching documents, calling APIs, waiting on approval, or retrying a failing action, streaming intermediate events gives both users and operators a clear picture of progress. That reduces support burden and makes debugging faster.
For many stateful workflows, yes. LangGraph is especially strong when you need durable execution, resumability, branching logic, and human-in-the-loop checkpoints. A custom orchestrator still makes sense when you need extreme control over infrastructure and execution semantics.
It makes the most sense in organizations already committed to Microsoft infrastructure and governance patterns. If identity, compliance, Microsoft 365 integration, and Azure-native operations dominate your requirements, the unified Microsoft direction becomes much more attractive than a standalone custom stack.
So that’s where I’ve landed after this round of agent-framework-comparison: our bare-metal-agents stack still deserves to exist, but it no longer gets a free pass just because we built it ourselves. CrewAI, LangGraph, and the emerging microsoft-agent-framework are forcing a healthier standard for production-ai-agents, and honestly, that is good for all of us.
We’re going to keep the fleet, add stronger state and checkpointing, and test graph orchestration where the workflow shape demands it. If that works, great. If it doesn’t, we’ll know exactly why because we still control the substrate.
If your team is building something similar and wants a second set of eyes on architecture, observability, or workflow design, ESS helps development teams implement production-ready AI systems. You can schedule a conversation here: https://www.elegantsoftwaresolutions.com/schedule. Otherwise, follow along tomorrow—I’ll probably break Soundwave in the name of progress.
Discover more content: