
🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4
An agent fleet can pass every unit test and still fail the only test that matters: a real human sending a real message from a real phone and getting the right agent to do the right thing. Ahead of go-live on June 4, 2026, the crew ran 27 live end-to-end Telegram route checks, plus native field tests and Jira field tests through the PM agent, Ultra Magnus. Those tests exercised the actual operator path, probed real capabilities, and in some cases created controlled scratch state that had to be cleaned up afterward.
That is the point of field testing. It closes the gap between “the code works” and “the operator’s message works.” It also forces discipline: if a self-test leaves behind a queued message, a stray file, or an uncleaned ticket, the test has introduced risk instead of reducing it. The result is a more honest readiness check for an agent fleet that has to perform outside the lab.
TL;DR: Unit and integration tests validate code paths; field tests validate the real operator experience from phone to gateway to agent.
The crew already had unit and integration coverage for routing logic and agent behavior. That mattered, but it was not sufficient. The full path from an operator’s phone to the final agent response includes failure modes that do not show up in isolated tests:
The practical lesson is simple: passing code-level tests does not guarantee that a real operator message will land on the right desk, reach the right agent, and produce the right outcome. Field tests exist to answer that question directly.
TL;DR: The live harness uses real Telegram messages for route regression and separate conversation-based field tests for real capability checks, with mutating cases gated behind an explicit flag.
The field-testing setup has two distinct jobs.
| Harness | Scope | Mutates external state? | Flag required |
|---|---|---|---|
telegram:e2e |
Live route regression | No | None |
telegram:conversations |
Operator-usefulness and field tests | Read-only by default; some cases can mutate scratch targets | --mutating for mutating cases |
This harness sends real messages to the chat bot and checks that they route correctly. The goal is not synthetic coverage. The goal is to confirm that the live operator path behaves as expected under real conditions.
The test-case shape is straightforward:
## Illustrative test-case shape (values sanitized)
- input: "What's the status of the current sprint?"
expected_route: pm-agent
assertion: response reflects the expected project context
cleanup: noneEach case follows the same pattern: input message → expected route → assertion → cleanup. For route-regression cases, cleanup is usually minimal because the tests are read-only. What matters is that the message goes through the real path and lands where it should.
Route regression proves that the message reached the right place. Native field tests prove that the agent can actually do the work once it gets there.
The verified field-test mix included read-only cases such as:
It also included controlled mutating cases, including Jira field tests through Ultra Magnus. Those mutating tests are intentionally limited to scratch targets and require explicit opt-in.
TL;DR: A field test that leaves residue behind is a failed field test, even if the main assertion passed.
Live field tests are messy by nature. They send real messages. They can create queued work. They can generate files. Mutating cases can create scratch state in external systems. That means cleanup is not an afterthought; it is part of the pass criteria.
The post-run hygiene sequence is explicit:
This discipline matters because residue becomes operational noise. A queued message that fires late can look like a real operator request. A leftover scratch ticket can pollute later queries. A generated file left in the worktree can create confusion in the next development cycle. In practice, cleanup is part of correctness.
TL;DR: A live test harness has production-level access by design, so mutating tests must be explicit, tightly scoped, and followed by verified cleanup.
A live harness is powerful because it tests the real system. That also makes it risky. If it can send real messages and touch real external systems, it needs production-grade guardrails.
The core controls are straightforward:
--mutating.That framing is less about cautionary language than about system design. A safe field-test harness is one that can exercise real capabilities without leaving operational residue behind.
TL;DR: The June 4 run demonstrated that the crew could handle real Telegram routing and real field-use cases, not just pass isolated test suites.
The key result from June 4 was not just that individual components behaved correctly. It was that the full system held up under live conditions.
The verified milestone included:
That combination matters because it validates both halves of readiness:
This is the teaching kernel behind the whole exercise: an agent fleet that passes unit tests can still fail the operator’s real message. Field tests close that gap.
Live route-regression testing sends real messages through the actual messaging path and checks that they route correctly. It validates the full operator path rather than a mocked version of it.
Route-regression tests answer, “Did the message reach the right place?” Native field tests answer, “Can the agent actually do the job once it gets there?” Both are necessary because correct routing alone does not prove operational usefulness.
Because they can create real state in external systems. Requiring --mutating prevents accidental writes during routine test runs and makes the operator’s intent explicit.
It means cancelling queued self-test sends, removing any scratch artifacts created during the run, moving generated test assets out of the repo worktree, and confirming the working tree is clean before considering the run complete.
It established that the crew passed 27 live Telegram end-to-end route checks and completed native field tests and Jira field tests through Ultra Magnus, providing a real-world readiness check ahead of activation.
Field testing is where an agent fleet stops being a software project and starts proving it can operate in the real world. The June 4 run showed why that matters: the meaningful test is not whether isolated components pass, but whether a real message from a real phone reaches the right agent and produces the right result without leaving operational residue behind.
Discover more content: