
🤖 Ghostwritten by Claude Opus 4.8 · Fact-checked & edited by GPT 5.5
A green CI check can still ship a broken agent.
Unit tests catch wrong return values, type errors, and broken integrations. They do not reliably catch a code-review agent that starts inventing struct field names after a prompt edit, a worker agent that needs triple the tool calls after a model swap, or a planning agent whose task-completion rate slips without throwing an exception.
That gap is why agent-specific CI is becoming its own discipline. Traditional CI validates code. Eval-driven CI validates behavior: prompts, task traces, tool-call patterns, schema conformance, and end-to-end task success. Before an updated agent reaches the farm, it should run against representative golden tasks on a canary node, produce structured scores, and pass a promotion gate.
The goal is not to make agent behavior perfectly deterministic. It is to keep probabilistic systems from regressing silently at fleet speed.
TL;DR: Standard CI assumes predictable outputs; agents fail probabilistically, so a green pipeline can still promote a regressed agent.
During pre-flight testing, ordinary code gates can look reassuring: lint passes, type checks pass, unit tests pass, integration tests pass. But those checks say little about whether the agent still performs its role well.
The failures that matter for agents are behavioral:
A unit test can assert expect(x).toBe(y). An agent response is a distribution, not a single value. Galileo has argued that traditional CI misses non-deterministic agent failures, which is why teams are moving toward eval-driven gates that score behavior as a release artifact.
That distinction matters. Agent CI is not just another test file in the repository. It is a promotion system for probabilistic workers.
TL;DR: The same throughput that makes an agent farm valuable also multiplies the cost of silent behavioral regressions.
A human engineer who picks up a bad habit produces bad work slowly enough for review to catch patterns before they spread. An autonomous agent with a regressed prompt can produce flawed work quickly, repeatedly, and in parallel.
That is the asymmetry of the software factory. Throughput is the advantage when agents behave well. Throughput becomes the risk when a weak prompt, bad tool policy, or poorly matched model is promoted broadly.
Post-deploy observability is still necessary, but it arrives late. It can tell a team that quality dropped, retries increased, or cost spiked. A promotion gate acts earlier: it stops a suspect agent before the rest of the fleet sees it.
Eval-driven CI turns agent behavior into something release engineering can reason about. The updated agent either clears the behavioral gate or it does not ship.
TL;DR: Golden tasks, canary tests, structured scoring, and human review form the practical evaluation stack for agent promotion.
The pattern has four parts: golden task sets, a canary node, structured scoring, and human review. That mirrors the emerging agent-evaluation stack identified in agent-engineering research.
Each agent owns a small set of representative tasks with known-good outputs. For a code-review agent, that might mean real diffs paired with the review verdicts a senior engineer would expect. For a planning agent, it might mean issue descriptions paired with acceptable implementation plans and required constraints.
The set should be small enough to run frequently and meaningful enough to catch regressions. It is not a comprehensive benchmark. It is a release gate.
A canary node runs the updated agent before promotion to the broader farm. The canary executes the golden task set, collects traces, records tool calls, and emits a score object for the gate.
This applies the canary-deployment pattern to agent behavior instead of server code. The question is not only whether the process starts. The question is whether the agent still completes the job safely and efficiently.
Each run should produce structured metrics, not a pass/fail vibe check.
| Metric | What it catches |
|---|---|
| Task success rate | End-to-end regressions in the agent's core function |
| Tool-call count | Efficiency drift, loops, and unnecessary tool use |
| Schema conformance | Malformed structured outputs and broken contracts |
| Hallucination rate on known fields | Invented identifiers against a known schema |
End-to-end task success rate, tool-call count, and schema conformance are the primary metrics. Hallucination checks are a useful diagnostic for agents that operate against known code, database, or API schemas.
Fleet-management hygiene also matters here. Infrastructure-as-code and version-controlled configuration are now baseline practices for deploying agent fleets. Microsoft Agent Framework 1.0 GA and CrewAI 1.14.6 both treat agent configurations as versioned artifacts. Eval-driven CI extends the same discipline to behavior: the configuration is versioned, and the observed behavior is gated.
A simplified TypeScript harness might look like this:
// shared/src/evals/harness.ts
export interface GoldenTask {
id: string;
input: string; // sanitized, injection-checked
knownGoodFields: string[]; // expected schema identifiers
expectedOutcome: string;
}
export interface AgentScore {
successRate: number;
meanToolCalls: number;
schemaConformance: number;
hallucinationRate: number;
}
export interface PromotionPolicy {
minSuccessRate: number;
maxMeanToolCalls: number;
minSchemaConformance: number;
maxHallucinationRate: number;
}
export async function runGoldenEval(
agentId: string,
tasks: GoldenTask[],
policy: PromotionPolicy,
): Promise<AgentScore> {
// Run the agent on the canary node, collect traces, and score behavior.
// The promotion gate compares the returned score with the policy.
}The gate compares AgentScore against the promotion policy. If the score falls outside the allowed range, rollout stops. If the score passes, the agent can move to the next stage, with human review reserved for ambiguous or high-risk changes.
TL;DR: Eval datasets drift, golden tasks can be poisoned, agents can overfit to the benchmark, and frontier-model evals on every PR can become expensive.
Eval-driven CI is useful, but it has sharp edges.
Eval dataset drift. Golden tasks go stale as the real workload changes. A task set captured from last quarter's work may stop representing the farm's current tasks. The mitigation is to refresh golden tasks from recent production traces on a schedule and track why each task belongs in the set.
Overfitting to the eval. Agents can be tuned to pass the golden tasks while getting worse at real work. A high eval score paired with declining production quality is the warning sign. Rotating tasks, keeping a holdout set, and preserving human review help keep the benchmark honest.
Prompt injection in eval inputs. Golden tasks are inputs, and inputs are untrusted data. A poisoned task can make a dangerous tool-call pattern appear correct, training the gate to accept behavior it should reject. Eval inputs need sanitization, injection checks, and review before they enter the harness.
Frontier-model eval cost. Full evals are not free. The Claude Agent SDK billing change on June 15, 2026 added direct cost pressure to running frontier-model evals on every pull request. The practical answer is tiering: cheap smoke evals on every PR, full golden runs at promotion time, and human review for high-risk changes.
TL;DR: The next maturity step is automated promotion through the canary, with hard stops on behavioral regression.
The near-term target is an automated promotion path: an agent update lands, the canary runs the golden task set, the scoring harness emits metrics, and the promotion gate either approves rollout or blocks it.
The happy path should be automated. The regression path should be absolute. No agent should reach the broader fleet because the code compiled while the behavior quietly degraded.
That is the larger shift: agent releases need the same rigor as software releases, plus evaluation machinery designed for non-deterministic behavior.
TL;DR: Eval-driven CI complements normal CI by measuring whether agents still complete real tasks safely, efficiently, and in the expected format.
Normal CI validates deterministic code with assertions that expect exact outputs. Eval-driven CI scores probabilistic agent behavior against representative tasks, using metrics such as task success rate, tool-call count, and schema conformance.
A golden task set is a small corpus of representative tasks paired with known-good outcomes. Each agent should have its own set, because a code-review agent, planning agent, and implementation agent fail in different ways.
A full golden run can be slower and more expensive than ordinary CI. A canary node lets the team reserve complete behavioral evaluation for promotion time while still running cheaper smoke checks on every pull request.
A poisoned golden task can make unsafe behavior appear acceptable. Eval inputs should be treated as untrusted data, sanitized before execution, and reviewed for prompt-injection attempts or suspicious tool-use patterns.
Yes. An agent can learn to perform well on a small benchmark while getting worse on real work. Rotating golden tasks, using holdout tasks, reviewing production traces, and keeping human review in the loop reduce that risk.
TL;DR: Agent-specific CI is release engineering for probabilistic workers, not a replacement for normal code tests.
TL;DR: The most important boundary in an agent farm is the gate between an updated agent and fleet-wide execution.
Agent-specific CI is becoming required infrastructure for autonomous development pipelines. Traditional tests still matter, but they only prove that the surrounding software behaves as expected. They do not prove that the agent still makes good decisions.
Eval-driven CI closes that gap by scoring behavior before promotion. As agent fleets grow, the teams that treat behavioral evaluation as part of release engineering will ship faster with less cleanup. The teams that rely on green code checks alone will discover regressions after the farm has already multiplied them.
Discover more content: