Rebuilding the Crew: Why We Nuked Our Monorepo and Started Over

When your agent ecosystem reaches more than a dozen agents and you can't remember which codename maps to which business function, you don't have a naming problem — you have an architecture problem. That realization is what led us to scrap our existing agent monorepo and rebuild from scratch with business-aligned naming, a cleaner project structure, and infrastructure patterns designed for the fleet we're deploying — not the prototype we started with.

This isn't a story about doing it right the first time. It's about hitting the wall where clever codenames, organic folder growth, and "we'll fix it later" decisions compound into a system that fights you every time you onboard a new agent.

The Codename Tax Was Killing Us

TL;DR: Cute internal codenames created a constant cognitive translation layer that slowed down every infrastructure task across secret management, monitoring, and scheduling.

Every engineering team loves codenames. We were no different. Our agents had evocative names that meant something to the people who built them — but meant nothing to anyone looking at a Prometheus dashboard, a 1Password vault, or a launchd job list for the first time.

When you're debugging why an agent's health check is failing at 2 AM, you don't want to mentally translate between three naming layers: the codename in the codebase, the business function it actually serves, and the infrastructure identifier in your monitoring stack.

With four or five agents, this was manageable. Past a dozen, it became a genuine operational hazard. We found ourselves maintaining an informal glossary — a spreadsheet that mapped codenames to functions — which is a classic sign that your naming convention has failed.

Where It Hurt Most

Infrastructure Layer	Problem with Codenames	Fix with Business Names
1Password Business	Items like `agents/sparkles/...` required tribal knowledge to audit	Items like `agents/content-writer/...` are self-documenting
Prometheus Metrics	Dashboards showed codenames new team members couldn't interpret	Labels like `agent="blog-publisher"` read naturally in alert rules
launchd Jobs	Job names like `com.ess.sparkles.heartbeat` meant nothing in `launchctl list`	`com.ess.content-writer.heartbeat` is immediately clear in system logs

The naming change isn't cosmetic. It's a force multiplier for every operational task — secret rotation, incident response, onboarding a new engineer who needs to understand the fleet.

Why a Full Restart Instead of a Rename

TL;DR: The naming problem was a symptom of organic growth patterns that couldn't be fixed with find-and-replace — folder structure, config patterns, and the dependency graph all needed a clean break.

The obvious question: why not just rename things in place? We considered it. The problem was deeper than names.

Our original monorepo grew organically. Each agent was added when we needed it, with whatever folder structure made sense at the time. Some had their own config directories, others shared a common config with overrides, and the dependency graph between shared utilities and agent-specific code had become a web of implicit assumptions.

A rename would have preserved all of that structural debt. What we needed was a clean architectural template that every agent — current and future — would conform to.

The New Structure

Every agent in the new monorepo follows an identical skeleton:

agents/
  content-writer/
    agent.yaml          # Identity, model config, permissions
    prompts/             # System prompts, versioned
    tools/               # Agent-specific tool definitions
    tests/               # Required — no exceptions
    docs/                # Operational runbook
  blog-publisher/
    agent.yaml
    ...
shared/
  secrets-client/        # 1Password Business integration
  metrics/               # Prometheus instrumentation
  file-manager/          # File-based project management utilities

The agent.yaml file is the key innovation. It's a single source of truth for each agent's identity — what model it uses, which 1Password items it can access, what Prometheus metrics it emits, and what launchd schedule it runs on. When we provision infrastructure for a new agent, we read from this file. No more scattered config.

A concrete example: where the old code did something like vault kv get secret/agents/sparkles/anthropic, the new shared client reads op read "op://Agents/content-writer/anthropic_api_key" — using 1Password's secret reference syntax, declared once in agent.yaml and resolved at boot.

That declarative posture also pays off when models change underneath us. When OpenAI shipped GPT-5.5 "Spud" on April 23, swapping a writer agent over was a one-line model: edit in agent.yaml plus a prompt-version bump — no shell scripts, no environment surgery. Frequent model rotation is a forcing function for declarative agent identity, and we'd rather pay that tax once in YAML than every time a new frontier model lands.

File-Based Project Management at Scale

TL;DR: Our agents coordinate through structured files in shared directories rather than message queues, and the restructure makes the pattern dramatically cleaner.

Instead of wiring agents together through message queues or a shared database, our agents read and write structured files — project briefs, status updates, handoff documents — to coordinate work.

Files are inspectable. You can cat a project brief and see exactly what one agent handed to another. There's no broker to crash, no queue to back up, no schema migration when the handoff format changes. Version control gives you a full audit trail for free.

The restructure standardized how agents declare their file interfaces. Each agent.yaml specifies inputs (file types and directories the agent watches), outputs (files it produces and where they land), and the schema for inter-agent handoff documents. It's contract-first design, where the contracts are file schemas instead of OpenAPI specs.

How This Compares to OpenClaw

TL;DR: OpenClaw's plugin-and-orchestrator model and our business-naming, file-handoff model are optimizing for different workloads, not competing on the same axis.

OpenClaw's growth on GitHub has spotlighted multi-agent orchestration patterns. Their approach — a central orchestrator coordinating dozens of app integrations through a plugin system — shares DNA with what we're building, but the design priorities differ.

Dimension	OpenClaw	Mac Mini Fleet (this approach)
Scale model	Many integrations, single orchestrator	Many specialized agents, shared infrastructure
Naming	App/plugin names from third parties	Business-function names we control
Coordination	Central message passing	File-based handoffs with audit trails
Deployment	Cloud-native containers	Mac mini cluster with `launchd` scheduling
Monitoring	Built-in dashboard	Prometheus + Grafana per-agent

OpenClaw optimizes for breadth of integration. We optimize for depth of autonomy — each agent in our fleet operates independently, owns its work products, and coordinates through explicit file contracts rather than a central bus. Neither approach is universally better; OpenClaw's plugin model is built for connecting external services, while ours is built for sustained, multi-step creative and operational work on our own hardware.

Security Implications of the Naming Change

TL;DR: Business naming actually improves security posture by making access policies, metric permissions, and job schedules auditable by anyone — not just the engineer who picked the codenames.

There's a counterintuitive security argument here. Codenames feel more secure because they obscure function. But security through obscurity in naming is worthless compared to the operational security gained from auditable infrastructure.

When every 1Password access policy, every Prometheus alert rule, and every launchd job is named after its business function, security reviews become dramatically faster. You can look at an agent's secret-access scope and immediately assess whether it has permissions it shouldn't.

For the incoming Mac mini cluster, this matters. Each physical machine will run a subset of agents, and the ability to quickly audit "which agents run on which hardware with what permissions" is a prerequisite for responsible fleet management.

Preparing for the Mac Mini Fleet — and What Comes After

TL;DR: The restructure is built for multi-machine deployment today and is deliberately portable to the deployment targets we expect tomorrow.

The timing of this rebuild isn't accidental. We're preparing for a Mac mini cluster that will distribute agents across multiple physical machines. The old monorepo would have made multi-machine deployment a nightmare of per-machine overrides and manual coordination.

The new structure supports a clean model: each machine gets a manifest declaring which agents it runs, and the provisioning system reads each agent's agent.yaml to configure 1Password access, Prometheus scrape targets, and launchd schedules automatically. Add a new agent to a machine? Add one line to the manifest. The infrastructure follows.

Designing for the next deployment target — not the current one — is the whole point. NVIDIA's April 14 launch of the Ising open quantum AI models alongside NVQLink is a reminder that the substrate beneath agent fleets is not stable: the next-next deployment target may not be a Mac mini or a Linux container at all. The agent.yaml declaration pattern is deployment-target agnostic, which is the only honest way to plan for hardware we can't yet buy.

Key Takeaways

Codenames create cognitive overhead that compounds with every agent you add — switch to business naming before you hit double digits.
A monorepo restart is cheaper than a monorepo migration when the structural debt is deep enough.
File-based coordination is underrated for agent ecosystems operating on human timescales.
Declarative agent identity (agent.yaml) is the foundation that makes automated provisioning — and frequent model swaps — survivable.
Business naming improves security by making access policies auditable without tribal knowledge.

Next in the series: the actual Mac mini cluster provisioning — how the agent.yaml pattern automates 1Password secret distribution, Prometheus target registration, and launchd job installation across multiple machines.