Field-Testing the Crew: From a Phone to the Gateway

🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4

An agent fleet can pass every unit test and still fail the only test that matters: a real human sending a real message from a real phone and getting the right agent to do the right thing. Ahead of go-live on June 4, 2026, the crew ran 27 live end-to-end Telegram route checks, plus native field tests and Jira field tests through the PM agent, Ultra Magnus. Those tests exercised the actual operator path, probed real capabilities, and in some cases created controlled scratch state that had to be cleaned up afterward.

That is the point of field testing. It closes the gap between “the code works” and “the operator’s message works.” It also forces discipline: if a self-test leaves behind a queued message, a stray file, or an uncleaned ticket, the test has introduced risk instead of reducing it. The result is a more honest readiness check for an agent fleet that has to perform outside the lab.

Why mocks were not enough

TL;DR: Unit and integration tests validate code paths; field tests validate the real operator experience from phone to gateway to agent.

The crew already had unit and integration coverage for routing logic and agent behavior. That mattered, but it was not sufficient. The full path from an operator’s phone to the final agent response includes failure modes that do not show up in isolated tests:

Message handling across the real Telegram gateway
Route conflicts that only appear when the full rule set is loaded
Capability checks that behave differently with live access than they do in a local test environment
Response behavior that is correct in a test harness but wrong in the actual chat flow

The practical lesson is simple: passing code-level tests does not guarantee that a real operator message will land on the right desk, reach the right agent, and produce the right outcome. Field tests exist to answer that question directly.

How the live harness works

TL;DR: The live harness uses real Telegram messages for route regression and separate conversation-based field tests for real capability checks, with mutating cases gated behind an explicit flag.

The field-testing setup has two distinct jobs.

Harness	Scope	Mutates external state?	Flag required
`telegram:e2e`	Live route regression	No	None
`telegram:conversations`	Operator-usefulness and field tests	Read-only by default; some cases can mutate scratch targets	`--mutating` for mutating cases

Route regression with `telegram:e2e`

This harness sends real messages to the chat bot and checks that they route correctly. The goal is not synthetic coverage. The goal is to confirm that the live operator path behaves as expected under real conditions.

The test-case shape is straightforward:

## Illustrative test-case shape (values sanitized)
- input: "What's the status of the current sprint?"
  expected_route: pm-agent
  assertion: response reflects the expected project context
  cleanup: none

Each case follows the same pattern: input message → expected route → assertion → cleanup. For route-regression cases, cleanup is usually minimal because the tests are read-only. What matters is that the message goes through the real path and lands where it should.

Native field tests with `telegram:conversations`

Route regression proves that the message reached the right place. Native field tests prove that the agent can actually do the work once it gets there.

The verified field-test mix included read-only cases such as:

Email read access
Finance-readiness checks, including systems used for payroll, time tracking, and reconciliation
Dev-team field cases such as source readiness and git preflight

It also included controlled mutating cases, including Jira field tests through Ultra Magnus. Those mutating tests are intentionally limited to scratch targets and require explicit opt-in.

Cleanup is part of the test

TL;DR: A field test that leaves residue behind is a failed field test, even if the main assertion passed.

Live field tests are messy by nature. They send real messages. They can create queued work. They can generate files. Mutating cases can create scratch state in external systems. That means cleanup is not an afterthought; it is part of the pass criteria.

The post-run hygiene sequence is explicit:

Cancel scheduled self-test queue entries. If a queued self-test send has not fired yet, it must be cancelled.
Clean up scratch state. Any controlled writes created during mutating tests must be removed from the scratch target.
Move generated test assets out of the repo worktree. Test outputs should not remain mixed into the working tree.
Confirm the working tree is clean. The run is not complete until the repo is back to a clean state.

This discipline matters because residue becomes operational noise. A queued message that fires late can look like a real operator request. A leftover scratch ticket can pollute later queries. A generated file left in the worktree can create confusion in the next development cycle. In practice, cleanup is part of correctness.

Security guardrails for production-level access

TL;DR: A live test harness has production-level access by design, so mutating tests must be explicit, tightly scoped, and followed by verified cleanup.

A live harness is powerful because it tests the real system. That also makes it risky. If it can send real messages and touch real external systems, it needs production-grade guardrails.

The core controls are straightforward:

Explicit opt-in for mutating tests. Mutating self-tests require --mutating.
Scratch-only scope. Writes are limited to controlled scratch targets rather than live operational records.
Cleanup and verification. The harness must verify that cleanup succeeded rather than assuming it did.
No lasting modifications. A test harness should never leave a real channel or external system in a modified state.

That framing is less about cautionary language than about system design. A safe field-test harness is one that can exercise real capabilities without leaving operational residue behind.

What the June 4 field tests proved

TL;DR: The June 4 run demonstrated that the crew could handle real Telegram routing and real field-use cases, not just pass isolated test suites.

The key result from June 4 was not just that individual components behaved correctly. It was that the full system held up under live conditions.

The verified milestone included:

27 out of 27 live Telegram end-to-end route checks passed
Native field tests passed
Jira field tests passed through the PM agent, Ultra Magnus

That combination matters because it validates both halves of readiness:

The message reaches the correct route through the real gateway
The receiving agent can perform the expected work in a live environment

This is the teaching kernel behind the whole exercise: an agent fleet that passes unit tests can still fail the operator’s real message. Field tests close that gap.

Frequently Asked Questions

Q: What is live route-regression testing for an agent fleet?

Live route-regression testing sends real messages through the actual messaging path and checks that they route correctly. It validates the full operator path rather than a mocked version of it.

Q: Why are native field tests separate from route-regression tests?

Route-regression tests answer, “Did the message reach the right place?” Native field tests answer, “Can the agent actually do the job once it gets there?” Both are necessary because correct routing alone does not prove operational usefulness.

Q: Why do mutating self-tests require a separate flag?

Because they can create real state in external systems. Requiring --mutating prevents accidental writes during routine test runs and makes the operator’s intent explicit.

Q: What does cleanup look like in practice?

It means cancelling queued self-test sends, removing any scratch artifacts created during the run, moving generated test assets out of the repo worktree, and confirming the working tree is clean before considering the run complete.

Q: What did the June 4 run establish before go-live?

It established that the crew passed 27 live Telegram end-to-end route checks and completed native field tests and Jira field tests through Ultra Magnus, providing a real-world readiness check ahead of activation.

Key Takeaways

Field tests validate the operator’s real path, not just the code path
27 of 27 live Telegram route checks passed ahead of go-live
Native field tests and Jira field tests confirmed real capability, not just routing
Mutating tests require explicit opt-in and scratch-only scope
Cleanup is part of the test: cancel queues, remove scratch state, move artifacts, and confirm a clean tree

Conclusion

Field testing is where an agent fleet stops being a software project and starts proving it can operate in the real world. The June 4 run showed why that matters: the meaningful test is not whether isolated components pass, but whether a real message from a real phone reaches the right agent and produces the right result without leaving operational residue behind.