🤖 Ghostwritten by Claude · Curated by Tom Hundley

This article was written by Claude and curated for publication by Tom Hundley.

Evals or Die: Unit Testing for Stochastic Systems

Software engineering principles for the age of probability.

The Death of Determinism

For 50 years, software testing was binary.

Input: 2 + 2
Expected Output: 4
Test Result: PASS

If the output was 4.00001, the test failed.

Welcome to AI Engineering, where 2 + 2 might equal 4, Four, or Here is a poem about the number 4.

If you try to use traditional unit testing assertions (assert result == expected) on an LLM, you will fail. But if you dont test, you are deploying a slot machine to production.

The Solution: LLM-as-a-Judge

Since the output is semantic (meaning-based) rather than syntactic (character-based), we need a semantic evaluator.

We use a stronger model (e.g., GPT-4o or Claude 3.5 Sonnet) to grade the homework of the production model.

The Eval Prompt

EVAL_PROMPT = 
You are an expert grader.
Question: {input}
Expected Answer Key: {reference}
Actual Model Answer: {output}

Grade the Actual Answer on a scale of 1-5 based on factual alignment with the Answer Key.
Ignore tone differences. Focus on facts.

The Pytest for Agents Stack

At Elegant Software Solutions, we dont reinvent the wheel. We use Pytest, the standard Python testing framework, but we add an Eval layer.

1. Deterministic Tests (The Easy Stuff)

Test the tools, not the LLM.

Does the search_tool crash if the API is down?
Does the Pydantic validator catch negative numbers?
These are standard unit tests.

2. Stochastic Tests (The Hard Stuff)

Test the agents reasoning.

Scenario: Ask the agent to plan a trip to Paris.
Assertion: Does the output contain a hotel? Is the hotel in Paris?
Implementation: Use a library like DeepEval or Ragas integrated into Pytest.

def test_paris_trip_planning():
    agent = Agent()
    response = agent.run(Plan a trip to Paris)
    
    # LLM-based assertion
    assert_llm_eval(
        response, 
        criteria=Must include a hotel recommendation and a museum visit
    )

The Golden Dataset

You cannot evaluate without a baseline. Every client engagement begins with building the Golden Dataset: 50-100 examples of Perfect Interactions.

Input: My order is late.
Ideal Output: Im sorry. I checked order #123 and it is delayed by weather. I have issued a $10 credit.

We run our agent against these 50 examples every single night (CI/CD). If the score drops from 4.8 to 4.5, we block the deployment.

Conclusion

The difference between a Demo and Enterprise Software is a test suite.

If you cant measure it, you cant improve it. And in AI, measuring means building a rigorous, automated evaluation pipeline that runs while you sleep.

This concludes our Engineering Agentic Reliability series. Ready to build? [Contact our Engineering Team](#).

How This Article Was Made

This article is a live example of the AI-enabled content workflow we build for clients.

Stage	Who	What
Research	Claude Opus 4.5	Analyzed current industry data, studies, and expert sources
Curation	Tom Hundley	Directed focus, validated relevance, ensured strategic alignment
Drafting	Claude Opus 4.5	Synthesized research into structured narrative
Fact-Check	Human + AI	All statistics linked to original sources below
Editorial	Tom Hundley	Final review for accuracy, tone, and value

The result: Research-backed content in a fraction of the time, with full transparency and human accountability.

Why We Work This Way

Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.

Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.

Want to build this capability for your team? Lets talk about AI enablement →