Part 10 of 12
π€ Ghostwritten by Claude Opus 4.5 Β· Curated by Tom Hundley
This article was written by Claude Opus 4.5 and curated for publication by Tom Hundley.
The hardest part of RAG is not building it. It is knowing whether it works.
You have built a RAG system. It retrieves documents, generates answers, and the demo went great. Then it goes to production, and three months later you discover:
If you have followed this series through LangChain (Part 2), LlamaIndex (Part 3), Haystack (Part 4), and the platform-specific implementations, you know how to build RAG systems. This article is about how to know if they work.
RAG evaluation is fundamentally harder than evaluating traditional software. A unit test can verify that add(2, 2) returns 4. But how do you verify that "Explain our refund policy" retrieves the right documents and generates an accurate, helpful response? The answer involves a combination of automated metrics, human evaluation, and continuous monitoring.
Before diving into solutions, we need to understand the unique challenges of evaluating RAG systems.
RAG has two distinct stages that can fail independently:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG FAILURE TAXONOMY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ β
β β RETRIEVAL FAILURES β β GENERATION FAILURES β β
β ββββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββββββββββββ€ β
β β β β β β
β β β’ Wrong documents retrievedβ β β’ Correct context, wrong answer β β
β β β’ Right docs ranked poorly β β β’ Hallucination despite context β β
β β β’ Missing relevant context β β β’ Over-reliance on parametric β β
β β β’ Retrieved but not used β β knowledge β β
β β β’ Semantic drift in query β β β’ Misinterpretation of context β β
β β β β β’ Incomplete synthesis β β
β ββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ β
β β
β The same wrong answer can result from: β
β 1. Retrieval found irrelevant docs β Generation did its best β
β 2. Retrieval found perfect docs β Generation ignored/misused them β
β 3. Partial retrieval failure β Partial generation failure β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββWhen a RAG system gives a wrong answer, you need to diagnose where it failed before you can fix it. This requires evaluating each stage independently.
Traditional ML evaluation relies on labeled datasets: input X should produce output Y. RAG breaks this assumption in several ways:
For retrieval:
For generation:
Creating ground truth datasets for RAG is expensive, subjective, and quickly outdated as your knowledge base evolves.
Consider two responses to "What is our return policy?":
Response A: "Our return policy allows returns within 30 days of purchase with original receipt. Items must be unused and in original packaging. Electronics have a 15-day return window."
Response B: "You can return stuff within a month if you have the receipt. Keep the packaging. Tech stuff is two weeks."
Both are technically correct. Response A is more professional. Response B is more conversational. Which is "better" depends on your use case, brand voice, and user expectations. Automated metrics struggle with these subjective dimensions.
A RAG system that performs brilliantly on your test queries may fail catastrophically on real user queries because:
This is why RAG evaluation must be continuous, not just at deployment time.
Several frameworks have emerged to address RAG evaluation. Each takes a different approach to the problem.
RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for RAG evaluation. Developed by the Explodinggradients team, it provides a comprehensive suite of metrics that evaluate both retrieval and generation quality.
pip install ragas datasetsRAGAS defines four primary metrics:
1. Faithfulness - Does the answer stay true to the retrieved context?
2. Context Precision - Are the retrieved documents relevant?
3. Context Recall - Did we retrieve all relevant documents?
4. Answer Relevancy - Does the answer actually address the question?
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Prepare your evaluation dataset
# Each example needs: question, answer, contexts, ground_truth (optional)
eval_data = {
"question": [
"What is the company's return policy?",
"How do I contact customer support?",
"What payment methods are accepted?",
],
"answer": [
"Returns are accepted within 30 days with original receipt. Items must be unused.",
"Customer support can be reached at support@example.com or 1-800-555-0199.",
"We accept Visa, Mastercard, American Express, and PayPal.",
],
"contexts": [
[
"Return Policy: Items may be returned within 30 days of purchase. Original receipt required.",
"Refund Processing: Refunds are processed within 5-7 business days."
],
[
"Contact Us: For support inquiries, email support@example.com or call 1-800-555-0199.",
],
[
"Payment Options: We accept major credit cards (Visa, Mastercard, American Express), PayPal, and Apple Pay.",
],
],
"ground_truth": [
"30-day returns with receipt, items must be unused and in original packaging.",
"Email: support@example.com, Phone: 1-800-555-0199",
"Visa, Mastercard, American Express, PayPal, and Apple Pay",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
# View aggregate results
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.85, 'context_recall': 0.79}
# Get per-question scores for debugging
df = results.to_pandas()
print(df)For production use, wrap RAGAS in a reusable evaluation pipeline:
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
import json
from datetime import datetime
import os
@dataclass
class RAGEvaluationResult:
"""Container for RAG evaluation results."""
timestamp: str
num_samples: int
metrics: Dict[str, float]
per_sample_scores: List[Dict[str, Any]]
config: Dict[str, Any]
class RAGASEvaluator:
"""Production-ready RAGAS evaluation wrapper."""
def __init__(
self,
metrics: Optional[List] = None,
save_results: bool = True,
results_dir: str = "./evaluation_results"
):
self.metrics = metrics or [
faithfulness,
answer_relevancy,
context_precision,
context_recall,
]
self.save_results = save_results
self.results_dir = results_dir
if save_results:
os.makedirs(results_dir, exist_ok=True)
def evaluate(
self,
questions: List[str],
answers: List[str],
contexts: List[List[str]],
ground_truths: Optional[List[str]] = None,
metadata: Optional[Dict[str, Any]] = None
) -> RAGEvaluationResult:
"""Run RAGAS evaluation on a set of RAG outputs."""
# Prepare dataset
eval_dict = {
"question": questions,
"answer": answers,
"contexts": contexts,
}
if ground_truths:
eval_dict["ground_truth"] = ground_truths
dataset = Dataset.from_dict(eval_dict)
# Run evaluation
results = evaluate(dataset, metrics=self.metrics)
# Extract per-sample scores
df = results.to_pandas()
# Create result object
evaluation_result = RAGEvaluationResult(
timestamp=datetime.now().isoformat(),
num_samples=len(questions),
metrics={k: v for k, v in results.items() if isinstance(v, (int, float))},
per_sample_scores=df.to_dict(orient='records'),
config={
"metrics": [m.name for m in self.metrics],
**(metadata or {})
}
)
if self.save_results:
self._save_results(evaluation_result)
return evaluation_result
def _save_results(self, result: RAGEvaluationResult):
"""Save evaluation results to JSON file."""
filename = f"eval_{result.timestamp.replace(':', '-')}.json"
filepath = os.path.join(self.results_dir, filename)
with open(filepath, 'w') as f:
json.dump({
"timestamp": result.timestamp,
"num_samples": result.num_samples,
"metrics": result.metrics,
"per_sample_scores": result.per_sample_scores,
"config": result.config
}, f, indent=2)
def compare_runs(self, run_ids: List[str]) -> Dict[str, Any]:
"""Compare metrics across multiple evaluation runs."""
results = []
for run_id in run_ids:
filepath = os.path.join(self.results_dir, f"eval_{run_id}.json")
with open(filepath, 'r') as f:
results.append(json.load(f))
comparison = {"runs": run_ids, "metric_comparison": {}}
all_metrics = set()
for r in results:
all_metrics.update(r["metrics"].keys())
for metric in all_metrics:
comparison["metric_comparison"][metric] = [
r["metrics"].get(metric) for r in results
]
return comparison
# Usage example
evaluator = RAGASEvaluator(save_results=True, results_dir="./rag_evaluations")
result = evaluator.evaluate(
questions=["What is the refund policy?", "How do I reset my password?"],
answers=[
"Refunds are processed within 30 days.",
"Click 'Forgot Password' on the login page."
],
contexts=[
["Refund Policy: All refunds processed within 30 days of request."],
["Password Reset: Use the 'Forgot Password' link on the login page."]
],
metadata={"version": "v1.2.0", "experiment": "baseline"}
)
print(f"Faithfulness: {result.metrics.get('faithfulness', 'N/A'):.2f}")TruLens takes a different approach, focusing on "feedback functions" - composable evaluation criteria that can be applied to any part of your RAG pipeline.
pip install trulens-eval trulens-providers-openaiTruLens wraps your RAG application and instruments it to capture:
from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback import Groundedness, AnswerRelevance, ContextRelevance
from trulens_eval.feedback.provider import OpenAI
import numpy as np
# Initialize TruLens
tru = Tru()
# Initialize feedback provider
openai_provider = OpenAI()
# Define feedback functions
groundedness = Groundedness(groundedness_provider=openai_provider)
answer_relevance = AnswerRelevance(provider=openai_provider)
context_relevance = ContextRelevance(provider=openai_provider)
# Create feedback objects
f_groundedness = Feedback(
groundedness.groundedness_measure_with_cot_reasons,
name="Groundedness"
).on(
TruChain.select_context()
).on_output()
f_answer_relevance = Feedback(
answer_relevance.relevance,
name="Answer Relevance"
).on_input().on_output()
f_context_relevance = Feedback(
context_relevance.relevance,
name="Context Relevance"
).on_input().on(
TruChain.select_context()
).aggregate(np.mean)
# Launch the dashboard
tru.run_dashboard()
# Export data for external analysis
records_df = tru.get_records_and_feedback()[0]
print(records_df.head())TruLens supports human feedback integration:
from trulens_eval.feedback import HumanFeedback
# Create a human feedback collector
human_feedback = HumanFeedback(
name="Human Quality Rating",
description="Rate the quality of this response (1-5)"
)
# When a human reviews a response, record it:
# tru.add_feedback(
# record_id="rec_123",
# feedback_name="Human Quality Rating",
# feedback_value=4.5,
# feedback_reason="Accurate and well-formatted"
# )DeepEval provides both reference-free and reference-based metrics with a focus on developer experience and CI/CD integration.
pip install deepevalfrom deepeval import evaluate
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
HallucinationMetric,
GEval,
)
from deepeval.test_case import LLMTestCase
# Create test cases
test_case = LLMTestCase(
input="What is the company's vacation policy?",
actual_output="Employees receive 15 days of PTO plus 10 holidays.",
expected_output="15 days PTO and 10 company holidays annually.",
retrieval_context=[
"Vacation Policy: Full-time employees receive 15 days of paid time off per year.",
"Company Holidays: The company observes 10 federal holidays annually."
]
)
# Initialize metrics
faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o-mini")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o-mini")
# Evaluate
evaluate(
test_cases=[test_case],
metrics=[faithfulness, relevancy, hallucination]
)
# Access scores
print(f"Faithfulness: {faithfulness.score}")
print(f"Relevancy: {relevancy.score}")
print(f"Hallucination: {hallucination.score}")G-Eval allows you to define custom evaluation criteria:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
# Define custom evaluation criteria
professionalism_metric = GEval(
name="Professionalism",
criteria="Determine if the response maintains a professional tone appropriate for business communication.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7
)
completeness_metric = GEval(
name="Completeness",
criteria="Evaluate whether the response fully addresses all aspects of the question.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7
)
evaluate(test_cases=[test_case], metrics=[professionalism_metric, completeness_metric])DeepEval provides excellent pytest integration:
# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
def rag_query(question: str) -> dict:
# Your actual RAG implementation
return {"answer": "Example answer", "contexts": ["Example context"]}
@pytest.mark.parametrize("question,expected_answer", [
("What is the return policy?", "30 days with receipt"),
("How do I contact support?", "Email support@example.com"),
])
def test_rag_quality(question: str, expected_answer: str):
result = rag_query(question)
test_case = LLMTestCase(
input=question,
actual_output=result["answer"],
expected_output=expected_answer,
retrieval_context=result["contexts"]
)
faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [faithfulness, relevancy])Run with pytest:
# Run evaluation tests
deepeval test run test_rag.py
# With verbose output
deepeval test run test_rag.py --verbose
# Generate report
deepeval test run test_rag.py --report| Feature | RAGAS | TruLens | DeepEval |
|---|---|---|---|
| Primary Focus | Comprehensive RAG metrics | Observability + feedback | CI/CD integration |
| Metric Style | Reference-based + reference-free | Feedback functions | G-Eval + custom |
| Human Feedback | Limited | Strong support | Moderate |
| Dashboard | External tools | Built-in | Cloud platform |
| CI/CD Integration | Manual | Manual | Native pytest |
| LLM Provider Lock-in | Any (via LangChain) | OpenAI-focused | Any |
| Learning Curve | Moderate | Steep | Gentle |
| Best For | Batch evaluation | Production monitoring | Automated testing |
Understanding what each metric measures helps you interpret results and diagnose issues.
What it measures: Of the documents retrieved, how many are actually relevant?
Formula: Precision = Relevant Retrieved / Total Retrieved
Implementation:
from typing import List
def calculate_context_precision(
relevant_doc_ids: List[str],
retrieved_doc_ids: List[str]
) -> float:
if not retrieved_doc_ids:
return 0.0
relevant_retrieved = set(relevant_doc_ids) & set(retrieved_doc_ids)
return len(relevant_retrieved) / len(retrieved_doc_ids)
# Example
relevant_ids = ["doc_1", "doc_3", "doc_7"] # Ground truth
retrieved_ids = ["doc_1", "doc_2", "doc_3", "doc_5"] # What we retrieved
precision = calculate_context_precision(relevant_ids, retrieved_ids)
print(f"Context Precision: {precision:.2f}") # 0.50 (2 of 4 are relevant)Interpretation:
What it measures: Of the documents that should have been retrieved, how many did we get?
Formula: Recall = Relevant Retrieved / Total Relevant
def calculate_context_recall(
relevant_doc_ids: List[str],
retrieved_doc_ids: List[str]
) -> float:
if not relevant_doc_ids:
return 1.0 # If nothing is relevant, we have perfect recall
relevant_retrieved = set(relevant_doc_ids) & set(retrieved_doc_ids)
return len(relevant_retrieved) / len(relevant_doc_ids)
# Example
recall = calculate_context_recall(relevant_ids, retrieved_ids)
print(f"Context Recall: {recall:.2f}") # 0.67 (2 of 3 relevant docs retrieved)What it measures: How highly ranked is the first relevant document?
Formula: MRR = (1/N) * Sum(1/rank_i)
def mean_reciprocal_rank(
all_retrieved: List[List[str]],
all_relevant: List[List[str]]
) -> float:
rr_scores = []
for retrieved, relevant in zip(all_retrieved, all_relevant):
relevant_set = set(relevant)
for i, doc_id in enumerate(retrieved):
if doc_id in relevant_set:
rr_scores.append(1.0 / (i + 1))
break
else:
rr_scores.append(0.0)
return sum(rr_scores) / len(rr_scores) if rr_scores else 0.0
# MRR of 1.0: First relevant doc is always rank 1
# MRR of 0.5: First relevant doc averages rank 2
# MRR of 0.33: First relevant doc averages rank 3What it measures: Did we retrieve at least one relevant document in the top K?
def hit_rate_at_k(
all_retrieved: List[List[str]],
all_relevant: List[List[str]],
k: int = 5
) -> float:
hits = 0
for retrieved, relevant in zip(all_retrieved, all_relevant):
top_k = set(retrieved[:k])
relevant_set = set(relevant)
if top_k & relevant_set:
hits += 1
return hits / len(all_retrieved) if all_retrieved else 0.0What it measures: Is every claim in the answer supported by the retrieved context?
This is the most critical RAG metric. A faithfulness score below 0.8 means your system is hallucinating.
How it works:
from openai import OpenAI
import json
from typing import List, Dict
client = OpenAI()
def calculate_faithfulness(answer: str, contexts: List[str]) -> Dict[str, any]:
full_context = "\n\n".join(contexts)
# Extract claims
claims_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Extract all factual claims from the text.
Return JSON: {"claims": ["claim1", "claim2", ...]}"""
}, {
"role": "user",
"content": f"Text: {answer}"
}],
response_format={"type": "json_object"}
)
claims = json.loads(claims_response.choices[0].message.content).get("claims", [])
if not claims:
return {"score": 1.0, "claims": [], "message": "No claims extracted"}
# Verify each claim
supported_count = 0
verifications = []
for claim in claims:
verify_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Is this claim supported by the context?
Return JSON: {"supported": boolean, "evidence": "explanation"}"""
}, {
"role": "user",
"content": f"Context: {full_context}\n\nClaim: {claim}"
}],
response_format={"type": "json_object"}
)
result = json.loads(verify_response.choices[0].message.content)
verifications.append({"claim": claim, **result})
if result.get("supported"):
supported_count += 1
return {
"score": supported_count / len(claims),
"total_claims": len(claims),
"supported_claims": supported_count,
"verifications": verifications
}
# Example
answer = "The company was founded in 2010 by John Smith. It has over 500 employees."
contexts = [
"Company History: Founded in 2010 by entrepreneur John Smith.",
"Company Size: As of 2024, the company employs approximately 450 staff."
]
result = calculate_faithfulness(answer, contexts)
print(f"Faithfulness: {result['score']:.2f}") # ~0.50 (1 of 2 claims supported)What it measures: Does the answer actually address the question asked?
How it works:
import numpy as np
def get_embedding(text: str) -> List[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: List[float], b: List[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def calculate_answer_relevancy(question: str, answer: str, n_generated: int = 3) -> float:
# Generate questions from answer
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": f"Generate {n_generated} questions this answer could respond to. Return JSON array."
}, {
"role": "user",
"content": f"Answer: {answer}"
}],
response_format={"type": "json_object"}
)
generated_questions = json.loads(response.choices[0].message.content).get("questions", [])
if not generated_questions:
return 0.0
# Compare embeddings
original_emb = get_embedding(question)
similarities = [
cosine_similarity(original_emb, get_embedding(q))
for q in generated_questions
]
return np.mean(similarities)Evaluation tells you if your RAG works during testing. Observability tells you if it keeps working in production.
If you are using LangChain (covered in Part 2), LangSmith provides seamless observability.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_your_api_key"
os.environ["LANGCHAIN_PROJECT"] = "rag-production"from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm | StrOutputParser()
# This call is automatically traced in LangSmith
response = chain.invoke({"question": "What is RAG?"})from langsmith import traceable, Client
client = Client()
@traceable(name="RAG Query", tags=["production"])
def rag_query(question: str, user_id: str) -> dict:
contexts = retrieve_documents(question)
response = generate_answer(question, contexts)
return {"answer": response, "contexts": contexts, "user_id": user_id}
def record_user_feedback(run_id: str, score: float, comment: str = None):
"""Record user feedback for a traced run."""
client.create_feedback(
run_id=run_id,
key="user_rating",
score=score,
comment=comment
)Langfuse provides similar functionality with the advantage of self-hosting options.
from langfuse.decorators import observe, langfuse_context
from langfuse import Langfuse
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com" # or self-hosted
)
@observe(as_type="generation")
def generate_answer(prompt: str, context: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
@observe()
def rag_pipeline(question: str) -> dict:
langfuse_context.update_current_observation(
metadata={"question_length": len(question)}
)
contexts = retrieve_documents(question)
langfuse_context.update_current_trace(
metadata={"num_contexts": len(contexts)}
)
answer = generate_answer(question, "\n".join([c.text for c in contexts]))
return {"answer": answer, "contexts": contexts}
# Log evaluation scores
def log_evaluation_scores(trace_id: str, faithfulness: float, relevancy: float):
langfuse.score(trace_id=trace_id, name="faithfulness", value=faithfulness)
langfuse.score(trace_id=trace_id, name="answer_relevancy", value=relevancy)Regardless of platform, log these essential metrics:
from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional, Dict, Any
@dataclass
class RAGTrace:
"""Complete trace for a RAG request."""
# Request identification
trace_id: str
timestamp: datetime
user_id: Optional[str]
session_id: Optional[str]
# Input
question: str
# Retrieval metrics
retrieval_latency_ms: float
num_docs_retrieved: int
retrieval_scores: List[float]
doc_ids: List[str]
# Generation metrics
generation_latency_ms: float
input_tokens: int
output_tokens: int
model_name: str
# Output
answer: str
# Quality scores (if available)
faithfulness_score: Optional[float] = None
relevancy_score: Optional[float] = None
user_feedback: Optional[float] = None
# Metadata
metadata: Dict[str, Any] = Nonefrom dataclasses import dataclass
from typing import Callable, List, Dict, Any
from enum import Enum
class AlertSeverity(Enum):
INFO = "info"
WARNING = "warning"
CRITICAL = "critical"
@dataclass
class AlertRule:
name: str
metric: str
condition: Callable[[float], bool]
severity: AlertSeverity
message_template: str
class RAGAlertManager:
def __init__(self, notify: Callable[[str, AlertSeverity], None]):
self.notify = notify
self.rules = [
AlertRule("high_latency", "avg_total_latency_ms",
lambda x: x > 5000, AlertSeverity.WARNING,
"Average latency {value:.0f}ms exceeds 5s"),
AlertRule("low_faithfulness", "avg_faithfulness",
lambda x: x is not None and x < 0.7, AlertSeverity.WARNING,
"Faithfulness {value:.2f} below 0.7"),
AlertRule("critical_faithfulness", "avg_faithfulness",
lambda x: x is not None and x < 0.5, AlertSeverity.CRITICAL,
"Faithfulness {value:.2f} indicates severe hallucination"),
]
def check_metrics(self, metrics: Dict[str, Any]):
for rule in self.rules:
value = metrics.get(rule.metric)
if value is not None and rule.condition(value):
message = rule.message_template.format(value=value)
self.notify(f"[{rule.name}] {message}", rule.severity)from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import json
@dataclass
class EvaluationExample:
question: str
ground_truth_answer: str
relevant_doc_ids: List[str]
category: str # "policy", "technical", "general"
difficulty: str # "easy", "medium", "hard"
metadata: Optional[Dict[str, Any]] = None
class EvaluationDatasetBuilder:
def __init__(self):
self.examples: List[EvaluationExample] = []
def add_from_production(self, traces: List[RAGTrace], human_labels: Dict[str, dict]):
for trace in traces:
if trace.trace_id in human_labels:
label = human_labels[trace.trace_id]
self.examples.append(EvaluationExample(
question=trace.question,
ground_truth_answer=label["ground_truth"],
relevant_doc_ids=label.get("relevant_docs", []),
category=label.get("category", "general"),
difficulty=label.get("difficulty", "medium"),
metadata={"source": "production", "trace_id": trace.trace_id}
))
def get_stratified_sample(self, n: int, by: str = "category") -> List[EvaluationExample]:
groups = {}
for ex in self.examples:
key = getattr(ex, by)
groups.setdefault(key, []).append(ex)
sample = []
per_group = n // len(groups)
for examples in groups.values():
sample.extend(random.sample(examples, min(per_group, len(examples))))
return sample[:n]
def save(self, filepath: str):
with open(filepath, 'w') as f:
json.dump([e.__dict__ for e in self.examples], f, indent=2)def generate_qa_pairs(document: str, doc_id: str, num_pairs: int = 5) -> List[EvaluationExample]:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": f"""Generate {num_pairs} question-answer pairs from the document.
Include difficulty levels (easy/medium/hard).
Return JSON: {{"pairs": [{{"question": "...", "answer": "...", "difficulty": "..."}}]}}"""
}, {
"role": "user",
"content": f"Document:\n\n{document}"
}],
response_format={"type": "json_object"}
)
pairs = json.loads(response.choices[0].message.content).get("pairs", [])
return [
EvaluationExample(
question=p["question"],
ground_truth_answer=p["answer"],
relevant_doc_ids=[doc_id],
category="general",
difficulty=p.get("difficulty", "medium"),
metadata={"source": "synthetic"}
)
for p in pairs
]Compare different retrieval configurations to find the best approach:
from dataclasses import dataclass
from typing import List, Dict, Any, Callable
import random
from datetime import datetime
@dataclass
class RetrievalConfig:
"""Configuration for a retrieval strategy."""
name: str
k: int # Number of documents to retrieve
search_type: str # "semantic", "hybrid", "keyword"
reranker: bool
chunk_size: int
class RetrievalABTest:
"""A/B testing framework for retrieval strategies."""
def __init__(
self,
configs: List[RetrievalConfig],
traffic_split: Dict[str, float] = None
):
self.configs = {c.name: c for c in configs}
# Default: even split
if traffic_split is None:
n = len(configs)
self.traffic_split = {c.name: 1.0/n for c in configs}
else:
self.traffic_split = traffic_split
self.results: Dict[str, List[Dict[str, Any]]] = {
c.name: [] for c in configs
}
def get_config(self, user_id: str = None) -> RetrievalConfig:
"""Get configuration for a request using consistent hashing."""
if user_id:
hash_val = hash(user_id) % 100 / 100.0
else:
hash_val = random.random()
cumulative = 0.0
for name, weight in self.traffic_split.items():
cumulative += weight
if hash_val < cumulative:
return self.configs[name]
return list(self.configs.values())[0]
def record_result(self, config_name: str, metrics: Dict[str, float]):
"""Record results for a configuration."""
self.results[config_name].append({
"timestamp": datetime.now().isoformat(),
"metrics": metrics
})
def analyze_results(self) -> Dict[str, Any]:
"""Analyze A/B test results with statistical significance."""
from scipy import stats
import numpy as np
analysis = {}
for metric in ["faithfulness", "relevancy", "latency_ms"]:
metric_results = {}
for config_name, results in self.results.items():
values = [r["metrics"].get(metric) for r in results if r["metrics"].get(metric)]
if values:
metric_results[config_name] = {
"mean": np.mean(values),
"std": np.std(values),
"n": len(values)
}
# Statistical comparison (if 2 configs)
if len(metric_results) == 2:
configs = list(metric_results.keys())
t_stat, p_value = stats.ttest_ind(
[r["metrics"].get(metric) for r in self.results[configs[0]] if r["metrics"].get(metric)],
[r["metrics"].get(metric) for r in self.results[configs[1]] if r["metrics"].get(metric)]
)
metric_results["significant"] = p_value < 0.05
metric_results["p_value"] = p_value
analysis[metric] = metric_results
return analysis
# Usage
ab_test = RetrievalABTest(
configs=[
RetrievalConfig("baseline", k=5, search_type="semantic", reranker=False, chunk_size=512),
RetrievalConfig("hybrid_rerank", k=10, search_type="hybrid", reranker=True, chunk_size=512),
],
traffic_split={"baseline": 0.5, "hybrid_rerank": 0.5}
)
# During query processing
config = ab_test.get_config(user_id="user_123")
# result = run_rag_with_config(query, config)
# ab_test.record_result(config.name, result.metrics)
# After sufficient data
analysis = ab_test.analyze_results()
print(f"Analysis: {analysis}")# .github/workflows/rag-quality.yml
name: RAG Quality Check
on:
pull_request:
paths: ['src/rag/**', 'prompts/**']
jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -r requirements.txt pytest ragas deepeval
- name: Run RAG quality tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
RAG_MIN_FAITHFULNESS: 0.75
RAG_MIN_RELEVANCY: 0.70
run: deepeval test run tests/test_rag_quality.py -v
- name: Upload evaluation results
uses: actions/upload-artifact@v4
with:
name: rag-evaluation-results
path: ./rag_evaluations/When using LLMs to evaluate LLM outputs, several biases emerge:
Self-enhancement bias: Models rate their own outputs higher.
# Mitigation: Use a different model for evaluation
# If generating with GPT-4o, evaluate with Claude
evaluation_model = "claude-3-sonnet-20240229"Verbosity bias: Longer answers get higher scores.
evaluation_prompt = """
Note: Longer answers are NOT inherently better.
A concise, accurate answer should score higher than a verbose, partially correct one.
"""Position bias: In pairwise comparisons, models favor the first option.
# Mitigation: Randomize order and average
def compare_unbiased(a: str, b: str) -> dict:
score_ab = evaluate_pair(a, b)
score_ba = evaluate_pair(b, a)
return {
"a": (score_ab["a"] + score_ba["b"]) / 2,
"b": (score_ab["b"] + score_ba["a"]) / 2
}Your dataset becomes stale as:
Solution: Periodically check coverage using embedding similarity between production queries and eval set.
# RAGAS with 4 metrics: ~4 LLM calls per example
# At $0.002/1K tokens, ~500 tokens/call:
# Cost per example: ~$0.004
# 10,000 queries/day: ~$40/day
class CostAwareEvaluator:
def __init__(self, daily_budget_usd: float = 10.0):
self.daily_budget = daily_budget_usd
self.cost_per_eval = 0.004
def should_evaluate(self, metadata: dict) -> bool:
# Always evaluate high-priority
if metadata.get("is_escalation") or metadata.get("user_tier") == "enterprise":
return True
# Sample 10% of remaining
return random.random() < 0.1# Anti-pattern: Prompt engineered for high faithfulness
bad_prompt = """
Start every sentence with "According to the context..."
Never add any information not explicitly stated.
"""
# Scores 0.99 faithfulness but produces robotic answers
# Better: Balance automated metrics with human evaluation
composite_score = 0.6 * automated_score + 0.4 * human_rating_normalizedYou now have the tools to measure whether your RAG system works, not just whether it runs.
In Part 11, we will cover Production RAG at Scale: deployment patterns, caching strategies, and handling thousands of queries per second without breaking the bank.
The evaluation foundations from this article will be essential. You cannot optimize what you cannot measure.
Evaluation Frameworks:
Observability Platforms:
Research:
This is Part 10 of the "Building RAG Systems: A Platform-by-Platform Guide" series. Navigate to Part 1: Foundations to start from the beginning, or continue to Part 11: Production RAG at Scale when it becomes available.
Discover more content: