Part 4 of 4
🤖 Ghostwritten by Claude · Curated by Tom Hundley
This article was written by Claude and curated for publication by Tom Hundley.
Software engineering principles for the age of probability.
For 50 years, software testing was binary.
2 + 24PASSIf the output was 4.00001, the test failed.
Welcome to AI Engineering, where 2 + 2 might equal 4, Four, or Here is a poem about the number 4.
If you try to use traditional unit testing assertions (assert result == expected) on an LLM, you will fail. But if you dont test, you are deploying a slot machine to production.
Since the output is semantic (meaning-based) rather than syntactic (character-based), we need a semantic evaluator.
We use a stronger model (e.g., GPT-4o or Claude 3.5 Sonnet) to grade the homework of the production model.
EVAL_PROMPT =
You are an expert grader.
Question: {input}
Expected Answer Key: {reference}
Actual Model Answer: {output}
Grade the Actual Answer on a scale of 1-5 based on factual alignment with the Answer Key.
Ignore tone differences. Focus on facts.
At Elegant Software Solutions, we dont reinvent the wheel. We use Pytest, the standard Python testing framework, but we add an Eval layer.
Test the tools, not the LLM.
search_tool crash if the API is down?Test the agents reasoning.
DeepEval or Ragas integrated into Pytest.def test_paris_trip_planning():
agent = Agent()
response = agent.run(Plan a trip to Paris)
# LLM-based assertion
assert_llm_eval(
response,
criteria=Must include a hotel recommendation and a museum visit
)You cannot evaluate without a baseline. Every client engagement begins with building the Golden Dataset: 50-100 examples of Perfect Interactions.
We run our agent against these 50 examples every single night (CI/CD). If the score drops from 4.8 to 4.5, we block the deployment.
The difference between a Demo and Enterprise Software is a test suite.
If you cant measure it, you cant improve it. And in AI, measuring means building a rigorous, automated evaluation pipeline that runs while you sleep.
This article is a live example of the AI-enabled content workflow we build for clients.
| Stage | Who | What |
|---|---|---|
| Research | Claude Opus 4.5 | Analyzed current industry data, studies, and expert sources |
| Curation | Tom Hundley | Directed focus, validated relevance, ensured strategic alignment |
| Drafting | Claude Opus 4.5 | Synthesized research into structured narrative |
| Fact-Check | Human + AI | All statistics linked to original sources below |
| Editorial | Tom Hundley | Final review for accuracy, tone, and value |
The result: Research-backed content in a fraction of the time, with full transparency and human accountability.
Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.
Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.
Want to build this capability for your team? Lets talk about AI enablement →
Part 4 of 4
Discover more content: