
🤖 Ghostwritten by Claude Opus 4.8 · Fact-checked & edited by GPT 5.5
In a multi-agent system, an agent should trust a peer agent’s instructions about as much as it trusts an anonymous email: not at all, until the instruction is cryptographically bound to a known human principal and scoped to a specific task. The common failure mode in 2026 production deployments is the opposite default: orchestrators and specialist agents that grant implicit mutual trust to anything inside the same fleet.
That design turns one compromised, prompt-injected, or hallucinating agent into a path toward broader system access. A low-privilege agent can issue out-of-scope tool calls to a more privileged neighbor, and the neighbor may execute them simply because the request came from inside the agent fleet.
The pattern mirrors microservice architectures before zero-trust adoption: flat internal networks where breaching one service could expose many others. Multi-agent systems recreate that flat topology, except the messages crossing boundaries are natural-language instructions that can manipulate the receiving agent’s reasoning. The fix is not better etiquette between agents. It is explicit trust architecture: capability-scoped tokens, signed task manifests, and sandboxed execution contexts.
TL;DR: Implicit intra-fleet trust collapses every agent’s privileges into a shared blast radius; one compromised agent can reach tools that should have remained out of scope.
Many agent orchestration frameworks still favor convenience: the orchestrator can call any specialist, specialists can call shared tools, and messages between them carry limited provenance. When agent A tells agent B to “summarize this document and email the result to finance,” agent B may have no reliable way to tell whether the email instruction originated from a legitimate human request or from text agent A ingested from an untrusted source.
That gap matters because prompt injection changes the trust calculus. Adversarial text embedded in documents, tickets, chat logs, web pages, or repository files can push an agent to emit instructions that look operationally valid to another agent. In a multi-agent topology, every agent that processes external content becomes a possible injection vector for peers.
The core problem is that agent-to-agent trust is often treated as binary and transitive. If agent A trusts agent B, and B trusts C, then A’s privileges can effectively flow to C through chained tool calls. Security teams would recognize the pattern as lateral movement. In agent systems, it appears as privilege escalation through chained tool calls, where a low-privilege agent reaches a high-privilege capability by routing a request through a neighbor with broader permissions.
The most important architectural distinction is between two kinds of authority:
A secure system permits human-delegated trust to flow downstream, bound to its originating principal, while preventing agent-delegated trust from manufacturing new authority. When agent B receives an instruction, it should evaluate the human principal behind it, not merely the agent that relayed it.
TL;DR: The primary threats are injected peer agents, hallucinated out-of-scope calls, and supply-chain compromise of shared skill registries — all amplified by transitive trust.
A defensible agentic AI threat model for a multi-agent pipeline in 2026 centers on three attack classes:
| Threat | Mechanism | Consequence |
|---|---|---|
| Compromised peer through injection | Adversarial content makes agent A emit malicious instructions to agent B | B executes attacker-chosen tool calls under apparent internal authority |
| Hallucinated escalation | Agent A fabricates a plausible but out-of-scope instruction to a privileged neighbor | Unauthorized action without a malicious actor — model error becomes system action |
| Supply-chain skill compromise | A shared agent skill or tool registry entry is poisoned | Every agent that loads the skill may inherit the compromise |
The hallucination case deserves emphasis because it has no attacker at all. A reasoning model under ambiguity may confidently generate an instruction such as “delete the stale records” and route it to a database agent. If that database agent trusts upstream instructions implicitly, a model error can become a destructive action. AI agent security must therefore treat every agent’s output as untrusted input to downstream components.
Shared skill registries — the reusable tools, prompts, and task modules agents pull from — are high-leverage targets. A poisoned skill that exfiltrates context or silently rewrites tool arguments can compromise every agent that imports it. Agent skill registries inherit familiar software supply-chain risks, with one additional complication: the malicious payload may be natural-language instructions that the model attempts to follow. Pinning skill versions, verifying signatures, and isolating untrusted skills in restricted contexts should be treated as baseline controls, not optional hardening.
TL;DR: Capability tokens replace ambient agent permissions with short-lived, least-privilege grants that downstream agents cannot broaden.
The foundational fix is to stop giving agents standing permissions and instead issue capability tokens: short-lived, narrowly scoped credentials minted for a specific task. A capability token answers a precise question: this agent, for this task, may invoke this tool with these constraints, until this expiry.
This is the agent-world application of zero trust: no implicit authority, verify every request, and grant only the minimum capability needed. Concretely, a capability token should encode:
When an orchestrator delegates a subtask, it should mint a narrower token than its own — never a broader one. This makes privilege escalation structurally harder through normal delegation: a downstream agent cannot grant itself a capability its token does not contain.
# Illustrative capability token claims using placeholder values
capability:
principal: "human:requesting-user-id"
tool: "crm.read_contact"
constraints:
fields: ["name", "email"]
record_scope: "assigned_accounts"
delegable: false
expires_at: "2026-06-22T18:00:00Z"
signature: "<signed-by-authority>"The delegable: false claim is what interrupts transitive trust. An agent holding a non-delegable token can use it, but cannot mint a child token from it. Agent-delegated trust cannot manufacture new reach.
TL;DR: A signed task manifest cryptographically ties a task to its originating human, so downstream agents verify provenance instead of trusting whichever agent relayed the message.
Capability tokens scope what an agent may do. Signed task manifests establish who actually asked.
A task manifest is a structured, cryptographically signed record of the original human-authorized intent, propagated alongside every agent-to-agent call. When agent B receives a request from agent A, it does not trust A’s phrasing by default. It verifies the signed manifest, confirms the requested action falls within the manifest’s declared scope, and rejects anything outside it.
If a prompt-injected agent A appends “and also transfer funds to this account,” that instruction has no corresponding entry in the signed manifest. Agent B should refuse it. The injection cannot mint authority it was never granted.
This mechanism distinguishes human-delegated trust from agent-delegated trust at runtime. The manifest is the chain of custody back to an accountable principal. Without it, every relayed instruction becomes anonymous and therefore unverifiable.
TL;DR: Sandboxing ensures a downstream agent runs only with explicitly injected capabilities and cannot inherit the ambient permissions of the agent that called it.
The third pillar is the sandboxed execution context. Even with tokens and manifests, runtime isolation is necessary so each agent executes with exactly the capabilities passed to it: no inherited environment, no shared credential store, and no ambient network reach to privileged tools.
In a sound agent permission model, each agent invocation runs in a context where:
This is agent orchestration security implemented structurally rather than by hoping the model behaves. A policy gateway between agents and tools becomes the chokepoint where capability tokens are verified, manifests are checked, and out-of-scope calls are denied and logged. It is the agent-era equivalent of a service mesh enforcing authentication and authorization between microservices.
The combination matters. Tokens scope authority. Manifests prove provenance. Sandboxes contain execution. Remove any one and the others weaken: a signed manifest without a sandbox can still leave credentials exposed, while a sandbox without scoped tokens can still give an agent overly broad standing permissions.
TL;DR: Trust boundaries work by separating provenance, permission, and runtime isolation so no agent can create authority simply by asking another agent to act.
Multi-agent trust boundaries are enforced separations that determine what one agent in a fleet may ask another agent to do, and what authority transfers across that call. They prevent a compromised or erroneous agent from commanding privileged peers by requiring verifiable provenance and scoped permissions instead of implicit mutual trust.
Human-to-agent trust originates with an accountable principal who authorized a specific action and can be audited. Agent-to-agent trust, when granted implicitly, lets agents manufacture authority among themselves with no human in the chain. Secure designs allow human-delegated authority to flow downstream while blocking agents from creating new authority on their own.
Yes. If a peer agent processes adversarial content and relays the injected instruction to a more privileged neighbor that trusts upstream messages implicitly, the injection can execute with the neighbor’s permissions. Signed task manifests reduce this risk by requiring every action to match a human-authorized scope, so injected instructions that lack a manifest entry are rejected.
Capability tokens scope what an agent may do: which tool, with which constraints, for how long. Signed task manifests prove who authorized the work by binding the task to an originating human principal. The two controls are complementary: tokens enforce least privilege, while manifests establish chain of custody.
Many are not. Most frameworks still default toward implicit intra-fleet trust, which is why the security community treats agent-to-agent trust as a critical gap in agentic AI. Production teams should assume they need explicit scoping, provenance, and sandboxing rather than relying on framework defaults.
TL;DR: Secure multi-agent systems treat every agent as untrusted until a request carries human-backed provenance, least-privilege authority, and runtime isolation.
TL;DR: Trust in agentic architectures should be issued deliberately, scoped narrowly, verified continuously, and revoked automatically.
The trajectory is clear: just as microservice security matured from flat internal networks to zero-trust service meshes, multi-agent systems are moving from implicit fleet trust to verifiable, scoped, sandboxed agent-to-agent authority.
The systems that will operate safely are the ones that treat each agent as an untrusted principal whose cross-boundary requests must carry proof of human-delegated authority and a least-privilege capability. The convenience of agents that trust each other by default is precisely what makes them dangerous. As agent pipelines take on higher-stakes actions, the cost of that convenience grows faster than the systems it powers. Trust, in agentic architectures, is something to be issued deliberately and revoked automatically — never assumed.
Discover more content: