
TL;DR. For most of our first quarter on a multi-agent platform, every agent shared one Anthropic API key. Then a runaway loop in a small agent ate a meaningful slice of the month's budget in an afternoon, and throttling at the provider level took the entire fleet down with it. We ripped out the shared key, issued one scoped key per agent with a hard monthly ceiling, and routed every secret through a single keyvault primitive. The next runaway took out exactly one agent. This is the build-log entry.
For most of our first quarter running a multi-agent platform, every agent on our crew shared a single Anthropic API key โ Sparkles, Soundwave, Optimus Prime, Rewind, all of them. It was fast, clean, and exactly the kind of default that felt fine until it didn't. Then a runaway loop in one small agent ate a meaningful slice of our monthly budget in an afternoon, and we ripped the shared key out for good.
This is the build-log entry for that decision: what was wrong, what per-agent keys with hard ceilings actually got us, and what we still don't have a clean answer for.
It was a Tuesday in early March. One of our smaller agents โ a notes-summarizer that nobody had touched in weeks โ got into a loop. A malformed tool response came back, the agent retried, the retry produced another malformed response, and the prompt grew on every pass because we were appending the failure to context. By the time a human noticed, the agent had burned through a meaningful slice of the month's Anthropic budget in a single afternoon. Every other agent on the platform โ Sparkles, Soundwave, Optimus Prime, Rewind โ was sharing the same API key. So when we finally rate-throttled at the provider level to stop the bleed, every agent on the platform stopped working at the same time.
That was the day we ripped out the shared key.
When you stand up a multi-agent platform on a deadline, the first instinct is to drop one provider key into a vault, reference it from every agent, and move on. We did this. It is fast. It is clean from a config standpoint. And it has a property nobody talks about out loud: every agent on the platform shares a single failure surface.
Concretely, with one shared LLM key:
It is hard to overstate how casually this risk gets accepted. Until early March, we had also been telling ourselves that the upstream provider was a stable dependency we did not need to model defensively. That assumption got noticeably wobblier on March 3, when the Pentagon publicly designated Anthropic a supply-chain risk โ the first time the U.S. government has applied that label to an AI vendor, reportedly tied to a dispute over Claude's use in classified analysis. A week later, on March 9, Anthropic sued the Department of Defense in two federal courts โ a development also covered as front-page tech news โ and the vendor put out its own statement on where it stood. None of that touches our actual API access. But it does change how you think about concentrated dependencies. If a regulator cannot tolerate a single point of exposure to one vendor, your fleet of agents probably should not either.
The shared key was not the cause of either of those events. It was just the thing that made our exposure to any upstream turbulence sit at maximum.
The fix is unglamorous. It is also one of the highest-leverage things we have done all quarter.
get_secret(agent_name, "llm_api_key") helper, which resolves to a per-agent path in the secret store. Adding a new agent is one config entry plus one key-issuance step. We do not sprinkle credentials through code.That is the entirety of the architectural change. No new service, no new framework, no fancy proxy layer. Just per-agent identity, per-agent ceilings, and a discipline about how secrets get loaded.
The morning after we shipped this, a different agent did exactly the same kind of thing the notes-summarizer had done two weeks earlier โ bad tool response, retry, growing context, retry. This time the agent's own key hit its ceiling around lunchtime, the provider locked it out, the alert fired, and no other agent noticed. Sparkles kept routing. Soundwave kept ingesting mail. Optimus Prime kept doing its thing. We fixed the misbehaving agent, raised its ceiling slightly, and moved on.
The concrete wins:
This is a build-log, so honesty about the open problems matters.
We now have more than thirty active provider keys across the crew, between LLM providers, image providers, embedding providers, and the long tail of integration agents. Rotating those keys is now its own ops problem. We do not have a fully automated rotation pipeline yet. Today, rotation is a quarterly ritual handled by a human running an internal script and watching for the agents that fail to pick up the new secret on the next call. That is fine at our current size; it will not be fine at three times this size.
The other thing we have not solved cleanly is per-agent budget tuning. The right ceiling for an agent is "a little more than its real usage." We are still setting those ceilings by gut, then adjusting after the first month of telemetry. We will probably automate that loop later this year.
If you are running a multi-agent platform and you still have a single shared LLM key referenced by every agent: that key is your blast radius. It is also your single point of credential leakage, your single rate-limit choke, and your single revocation surface. Per-agent keys with per-key ceilings cost you a one-time refactor and a slightly larger ops burden. They buy you a fleet that fails one agent at a time instead of all at once.
We learned that the expensive way. You do not have to.
Discover more content: