RAG Evaluation and Observability: Measuring What Matters

The hardest part of RAG is not building it. It is knowing whether it works.

The Evaluation Crisis

You have built a RAG system. It retrieves documents, generates answers, and the demo went great. Then it goes to production, and three months later you discover:

Users complain answers are "off" but cannot articulate why
The system confidently cites documents that contradict its answers
Response quality varies wildly depending on query phrasing
Nobody knows if last week's "improvement" actually improved anything

If you have followed this series through LangChain (Part 2), LlamaIndex (Part 3), Haystack (Part 4), and the platform-specific implementations, you know how to build RAG systems. This article is about how to know if they work.

RAG evaluation is fundamentally harder than evaluating traditional software. A unit test can verify that add(2, 2) returns 4. But how do you verify that "Explain our refund policy" retrieves the right documents and generates an accurate, helpful response? The answer involves a combination of automated metrics, human evaluation, and continuous monitoring.

Why RAG Evaluation is Hard

Before diving into solutions, we need to understand the unique challenges of evaluating RAG systems.

Multiple Failure Points

RAG has two distinct stages that can fail independently:

Diagram 1 from RAG Evaluation and Observability: Measuring What Matters

When a RAG system gives a wrong answer, you need to diagnose where it failed before you can fix it. This requires evaluating each stage independently.

The Ground Truth Problem

Traditional ML evaluation relies on labeled datasets: input X should produce output Y. RAG breaks this assumption in several ways:

For retrieval:

What documents should be retrieved for a given query?
Is there only one correct set, or multiple valid sets?
How do you handle queries where the answer spans multiple documents?

For generation:

What is the "correct" answer to an open-ended question?
Should the answer include caveats present in the source?
How do you handle questions with multiple valid phrasings?

Creating ground truth datasets for RAG is expensive, subjective, and quickly outdated as your knowledge base evolves.

Subjective Quality Metrics

Consider two responses to "What is our return policy?":

Response A: "Our return policy allows returns within 30 days of purchase with original receipt. Items must be unused and in original packaging. Electronics have a 15-day return window."

Response B: "You can return stuff within a month if you have the receipt. Keep the packaging. Tech stuff is two weeks."

Both are technically correct. Response A is more professional. Response B is more conversational. Which is "better" depends on your use case, brand voice, and user expectations. Automated metrics struggle with these subjective dimensions.

The "Works on My Laptop" Problem

A RAG system that performs brilliantly on your test queries may fail catastrophically on real user queries because:

Test queries are clean; real queries have typos, slang, and ambiguity
Test queries cover the happy path; real queries probe edge cases
Test documents are representative; production documents are messy
Test evaluation happens once; production quality drifts over time

This is why RAG evaluation must be continuous, not just at deployment time.

Evaluation Frameworks

Several frameworks have emerged to address RAG evaluation. Each takes a different approach to the problem.

RAGAS: The Standard Bearer

RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for RAG evaluation. Developed by the Explodinggradients team, it provides a comprehensive suite of metrics that evaluate both retrieval and generation quality.

Installation and Core Metrics

pip install ragas datasets

RAGAS defines four primary metrics:

1. Faithfulness - Does the answer stay true to the retrieved context?

Score of 1.0: Every claim in the answer is supported by the context
Score of 0.0: The answer contradicts or goes beyond the context
This is the most critical RAG metric. A score below 0.8 indicates hallucination problems.

2. Context Precision - Are the retrieved documents relevant?

Measures what fraction of retrieved contexts are actually needed
High precision: Every retrieved document contributes to the answer
Low precision: Many retrieved documents are noise

3. Context Recall - Did we retrieve all relevant documents?

Measures whether the ground truth answer can be attributed to retrieved context
High recall: All necessary information was retrieved
Low recall: Important information was missed

4. Answer Relevancy - Does the answer actually address the question?

Uses an LLM to generate questions from the answer
Compares semantic similarity to the original question
Low scores indicate tangential or off-topic responses

Complete RAGAS Example

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Prepare your evaluation dataset
# Each example needs: question, answer, contexts, ground_truth (optional)
eval_data = {
    "question": [
        "What is the company's return policy?",
        "How do I contact customer support?",
        "What payment methods are accepted?",
    ],
    "answer": [
        "Returns are accepted within 30 days with original receipt. Items must be unused.",
        "Customer support can be reached at support@example.com or 1-800-555-0199.",
        "We accept Visa, Mastercard, American Express, and PayPal.",
    ],
    "contexts": [
        [
            "Return Policy: Items may be returned within 30 days of purchase. Original receipt required.",
            "Refund Processing: Refunds are processed within 5-7 business days."
        ],
        [
            "Contact Us: For support inquiries, email support@example.com or call 1-800-555-0199.",
        ],
        [
            "Payment Options: We accept major credit cards (Visa, Mastercard, American Express), PayPal, and Apple Pay.",
        ],
    ],
    "ground_truth": [
        "30-day returns with receipt, items must be unused and in original packaging.",
        "Email: support@example.com, Phone: 1-800-555-0199",
        "Visa, Mastercard, American Express, PayPal, and Apple Pay",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# View aggregate results
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.79}

# Get per-question scores for debugging
df = results.to_pandas()
print(df)

Production RAGAS Wrapper

For production use, wrap RAGAS in a reusable evaluation pipeline:

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
import json
from datetime import datetime
import os

@dataclass
class RAGEvaluationResult:
    """Container for RAG evaluation results."""
    timestamp: str
    num_samples: int
    metrics: Dict[str, float]
    per_sample_scores: List[Dict[str, Any]]
    config: Dict[str, Any]

class RAGASEvaluator:
    """Production-ready RAGAS evaluation wrapper."""

    def __init__(
        self,
        metrics: Optional[List] = None,
        save_results: bool = True,
        results_dir: str = "./evaluation_results"
    ):
        self.metrics = metrics or [
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ]
        self.save_results = save_results
        self.results_dir = results_dir

        if save_results:
            os.makedirs(results_dir, exist_ok=True)

    def evaluate(
        self,
        questions: List[str],
        answers: List[str],
        contexts: List[List[str]],
        ground_truths: Optional[List[str]] = None,
        metadata: Optional[Dict[str, Any]] = None
    ) -> RAGEvaluationResult:
        """Run RAGAS evaluation on a set of RAG outputs."""
        # Prepare dataset
        eval_dict = {
            "question": questions,
            "answer": answers,
            "contexts": contexts,
        }

        if ground_truths:
            eval_dict["ground_truth"] = ground_truths

        dataset = Dataset.from_dict(eval_dict)

        # Run evaluation
        results = evaluate(dataset, metrics=self.metrics)

        # Extract per-sample scores
        df = results.to_pandas()

        # Create result object
        evaluation_result = RAGEvaluationResult(
            timestamp=datetime.now().isoformat(),
            num_samples=len(questions),
            metrics={k: v for k, v in results.items() if isinstance(v, (int, float))},
            per_sample_scores=df.to_dict(orient='records'),
            config={
                "metrics": [m.name for m in self.metrics],
                **(metadata or {})
            }
        )

        if self.save_results:
            self._save_results(evaluation_result)

        return evaluation_result

    def _save_results(self, result: RAGEvaluationResult):
        """Save evaluation results to JSON file."""
        filename = f"eval_{result.timestamp.replace(':', '-')}.json"
        filepath = os.path.join(self.results_dir, filename)

        with open(filepath, 'w') as f:
            json.dump({
                "timestamp": result.timestamp,
                "num_samples": result.num_samples,
                "metrics": result.metrics,
                "per_sample_scores": result.per_sample_scores,
                "config": result.config
            }, f, indent=2)

    def compare_runs(self, run_ids: List[str]) -> Dict[str, Any]:
        """Compare metrics across multiple evaluation runs."""
        results = []
        for run_id in run_ids:
            filepath = os.path.join(self.results_dir, f"eval_{run_id}.json")
            with open(filepath, 'r') as f:
                results.append(json.load(f))

        comparison = {"runs": run_ids, "metric_comparison": {}}

        all_metrics = set()
        for r in results:
            all_metrics.update(r["metrics"].keys())

        for metric in all_metrics:
            comparison["metric_comparison"][metric] = [
                r["metrics"].get(metric) for r in results
            ]

        return comparison


# Usage example
evaluator = RAGASEvaluator(save_results=True, results_dir="./rag_evaluations")

result = evaluator.evaluate(
    questions=["What is the refund policy?", "How do I reset my password?"],
    answers=[
        "Refunds are processed within 30 days.",
        "Click 'Forgot Password' on the login page."
    ],
    contexts=[
        ["Refund Policy: All refunds processed within 30 days of request."],
        ["Password Reset: Use the 'Forgot Password' link on the login page."]
    ],
    metadata={"version": "v1.2.0", "experiment": "baseline"}
)

print(f"Faithfulness: {result.metrics.get('faithfulness', 'N/A'):.2f}")

TruLens: Feedback-Driven Evaluation

TruLens takes a different approach, focusing on "feedback functions" - composable evaluation criteria that can be applied to any part of your RAG pipeline.

Installation and Setup

pip install trulens-eval trulens-providers-openai

TruLens Architecture

TruLens wraps your RAG application and instruments it to capture:

Inputs and outputs at each stage
Latency measurements
Token usage
Custom feedback scores

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback import Groundedness, AnswerRelevance, ContextRelevance
from trulens_eval.feedback.provider import OpenAI
import numpy as np

# Initialize TruLens
tru = Tru()

# Initialize feedback provider
openai_provider = OpenAI()

# Define feedback functions
groundedness = Groundedness(groundedness_provider=openai_provider)
answer_relevance = AnswerRelevance(provider=openai_provider)
context_relevance = ContextRelevance(provider=openai_provider)

# Create feedback objects
f_groundedness = Feedback(
    groundedness.groundedness_measure_with_cot_reasons,
    name="Groundedness"
).on(
    TruChain.select_context()
).on_output()

f_answer_relevance = Feedback(
    answer_relevance.relevance,
    name="Answer Relevance"
).on_input().on_output()

f_context_relevance = Feedback(
    context_relevance.relevance,
    name="Context Relevance"
).on_input().on(
    TruChain.select_context()
).aggregate(np.mean)

# Launch the dashboard
tru.run_dashboard()

# Export data for external analysis
records_df = tru.get_records_and_feedback()[0]
print(records_df.head())

Human-in-the-Loop with TruLens

TruLens supports human feedback integration:

from trulens_eval.feedback import HumanFeedback

# Create a human feedback collector
human_feedback = HumanFeedback(
    name="Human Quality Rating",
    description="Rate the quality of this response (1-5)"
)

# When a human reviews a response, record it:
# tru.add_feedback(
#     record_id="rec_123",
#     feedback_name="Human Quality Rating",
#     feedback_value=4.5,
#     feedback_reason="Accurate and well-formatted"
# )

DeepEval: Modern Evaluation Suite

DeepEval provides both reference-free and reference-based metrics with a focus on developer experience and CI/CD integration.

Installation and Basic Usage

pip install deepeval

from deepeval import evaluate
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    HallucinationMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase

# Create test cases
test_case = LLMTestCase(
    input="What is the company's vacation policy?",
    actual_output="Employees receive 15 days of PTO plus 10 holidays.",
    expected_output="15 days PTO and 10 company holidays annually.",
    retrieval_context=[
        "Vacation Policy: Full-time employees receive 15 days of paid time off per year.",
        "Company Holidays: The company observes 10 federal holidays annually."
    ]
)

# Initialize metrics
faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o-mini")
relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini")
hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o-mini")

# Evaluate
evaluate(
    test_cases=[test_case],
    metrics=[faithfulness, relevancy, hallucination]
)

# Access scores
print(f"Faithfulness: {faithfulness.score}")
print(f"Relevancy: {relevancy.score}")
print(f"Hallucination: {hallucination.score}")

Custom G-Eval Criteria

G-Eval allows you to define custom evaluation criteria:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Define custom evaluation criteria
professionalism_metric = GEval(
    name="Professionalism",
    criteria="Determine if the response maintains a professional tone appropriate for business communication.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)

completeness_metric = GEval(
    name="Completeness",
    criteria="Evaluate whether the response fully addresses all aspects of the question.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)

evaluate(test_cases=[test_case], metrics=[professionalism_metric, completeness_metric])

DeepEval CI/CD Integration

DeepEval provides excellent pytest integration:

# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric

def rag_query(question: str) -> dict:
    # Your actual RAG implementation
    return {"answer": "Example answer", "contexts": ["Example context"]}

@pytest.mark.parametrize("question,expected_answer", [
    ("What is the return policy?", "30 days with receipt"),
    ("How do I contact support?", "Email support@example.com"),
])
def test_rag_quality(question: str, expected_answer: str):
    result = rag_query(question)

    test_case = LLMTestCase(
        input=question,
        actual_output=result["answer"],
        expected_output=expected_answer,
        retrieval_context=result["contexts"]
    )

    faithfulness = FaithfulnessMetric(threshold=0.7)
    relevancy = AnswerRelevancyMetric(threshold=0.7)

    assert_test(test_case, [faithfulness, relevancy])

Run with pytest:

# Run evaluation tests
deepeval test run test_rag.py

# With verbose output
deepeval test run test_rag.py --verbose

# Generate report
deepeval test run test_rag.py --report

Framework Comparison

Feature	RAGAS	TruLens	DeepEval
Primary Focus	Comprehensive RAG metrics	Observability + feedback	CI/CD integration
Metric Style	Reference-based + reference-free	Feedback functions	G-Eval + custom
Human Feedback	Limited	Strong support	Moderate
Dashboard	External tools	Built-in	Cloud platform
CI/CD Integration	Manual	Manual	Native pytest
LLM Provider Lock-in	Any (via LangChain)	OpenAI-focused	Any
Learning Curve	Moderate	Steep	Gentle
Best For	Batch evaluation	Production monitoring	Automated testing

Key Metrics Explained

Understanding what each metric measures helps you interpret results and diagnose issues.

Retrieval Metrics

Context Precision

What it measures: Of the documents retrieved, how many are actually relevant?

Formula: Precision = Relevant Retrieved / Total Retrieved

Implementation:

from typing import List

def calculate_context_precision(
    relevant_doc_ids: List[str],
    retrieved_doc_ids: List[str]
) -> float:
    if not retrieved_doc_ids:
        return 0.0
    relevant_retrieved = set(relevant_doc_ids) & set(retrieved_doc_ids)
    return len(relevant_retrieved) / len(retrieved_doc_ids)

# Example
relevant_ids = ["doc_1", "doc_3", "doc_7"]  # Ground truth
retrieved_ids = ["doc_1", "doc_2", "doc_3", "doc_5"]  # What we retrieved

precision = calculate_context_precision(relevant_ids, retrieved_ids)
print(f"Context Precision: {precision:.2f}")  # 0.50 (2 of 4 are relevant)

Interpretation:

Score of 1.0: Every retrieved document is relevant
Score of 0.5: Half of retrieved documents are noise
Score below 0.3: Mostly retrieving irrelevant documents - check embeddings and chunking

Context Recall

What it measures: Of the documents that should have been retrieved, how many did we get?

Formula: Recall = Relevant Retrieved / Total Relevant

def calculate_context_recall(
    relevant_doc_ids: List[str],
    retrieved_doc_ids: List[str]
) -> float:
    if not relevant_doc_ids:
        return 1.0  # If nothing is relevant, we have perfect recall
    relevant_retrieved = set(relevant_doc_ids) & set(retrieved_doc_ids)
    return len(relevant_retrieved) / len(relevant_doc_ids)

# Example
recall = calculate_context_recall(relevant_ids, retrieved_ids)
print(f"Context Recall: {recall:.2f}")  # 0.67 (2 of 3 relevant docs retrieved)

Mean Reciprocal Rank (MRR)

What it measures: How highly ranked is the first relevant document?

Formula: MRR = (1/N) * Sum(1/rank_i)

def mean_reciprocal_rank(
    all_retrieved: List[List[str]],
    all_relevant: List[List[str]]
) -> float:
    rr_scores = []
    for retrieved, relevant in zip(all_retrieved, all_relevant):
        relevant_set = set(relevant)
        for i, doc_id in enumerate(retrieved):
            if doc_id in relevant_set:
                rr_scores.append(1.0 / (i + 1))
                break
        else:
            rr_scores.append(0.0)
    return sum(rr_scores) / len(rr_scores) if rr_scores else 0.0

# MRR of 1.0: First relevant doc is always rank 1
# MRR of 0.5: First relevant doc averages rank 2
# MRR of 0.33: First relevant doc averages rank 3

Hit Rate / Recall@K

What it measures: Did we retrieve at least one relevant document in the top K?

def hit_rate_at_k(
    all_retrieved: List[List[str]],
    all_relevant: List[List[str]],
    k: int = 5
) -> float:
    hits = 0
    for retrieved, relevant in zip(all_retrieved, all_relevant):
        top_k = set(retrieved[:k])
        relevant_set = set(relevant)
        if top_k & relevant_set:
            hits += 1
    return hits / len(all_retrieved) if all_retrieved else 0.0

Generation Metrics

Faithfulness (Critical for RAG)

What it measures: Is every claim in the answer supported by the retrieved context?

This is the most critical RAG metric. A faithfulness score below 0.8 means your system is hallucinating.

How it works:

Extract factual claims from the generated answer
For each claim, check if it can be inferred from the context
Score = (supported claims) / (total claims)

from openai import OpenAI
import json
from typing import List, Dict

client = OpenAI()

def calculate_faithfulness(answer: str, contexts: List[str]) -> Dict[str, any]:
    full_context = "\n\n".join(contexts)

    # Extract claims
    claims_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Extract all factual claims from the text.
            Return JSON: {"claims": ["claim1", "claim2", ...]}"""
        }, {
            "role": "user",
            "content": f"Text: {answer}"
        }],
        response_format={"type": "json_object"}
    )
    claims = json.loads(claims_response.choices[0].message.content).get("claims", [])

    if not claims:
        return {"score": 1.0, "claims": [], "message": "No claims extracted"}

    # Verify each claim
    supported_count = 0
    verifications = []

    for claim in claims:
        verify_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Is this claim supported by the context?
                Return JSON: {"supported": boolean, "evidence": "explanation"}"""
            }, {
                "role": "user",
                "content": f"Context: {full_context}\n\nClaim: {claim}"
            }],
            response_format={"type": "json_object"}
        )
        result = json.loads(verify_response.choices[0].message.content)
        verifications.append({"claim": claim, **result})
        if result.get("supported"):
            supported_count += 1

    return {
        "score": supported_count / len(claims),
        "total_claims": len(claims),
        "supported_claims": supported_count,
        "verifications": verifications
    }

# Example
answer = "The company was founded in 2010 by John Smith. It has over 500 employees."
contexts = [
    "Company History: Founded in 2010 by entrepreneur John Smith.",
    "Company Size: As of 2024, the company employs approximately 450 staff."
]

result = calculate_faithfulness(answer, contexts)
print(f"Faithfulness: {result['score']:.2f}")  # ~0.50 (1 of 2 claims supported)

Answer Relevancy

What it measures: Does the answer actually address the question asked?

How it works:

Generate N questions that the answer could plausibly be responding to
Compare semantic similarity of generated questions to original question
High similarity = high relevancy

import numpy as np

def get_embedding(text: str) -> List[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: List[float], b: List[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def calculate_answer_relevancy(question: str, answer: str, n_generated: int = 3) -> float:
    # Generate questions from answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": f"Generate {n_generated} questions this answer could respond to. Return JSON array."
        }, {
            "role": "user",
            "content": f"Answer: {answer}"
        }],
        response_format={"type": "json_object"}
    )
    generated_questions = json.loads(response.choices[0].message.content).get("questions", [])

    if not generated_questions:
        return 0.0

    # Compare embeddings
    original_emb = get_embedding(question)
    similarities = [
        cosine_similarity(original_emb, get_embedding(q))
        for q in generated_questions
    ]
    return np.mean(similarities)

Observability and Monitoring

Evaluation tells you if your RAG works during testing. Observability tells you if it keeps working in production.

LangSmith: LangChain's Native Platform

If you are using LangChain (covered in Part 2), LangSmith provides seamless observability.

Setup

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_your_api_key"
os.environ["LANGCHAIN_PROJECT"] = "rag-production"

Automatic Tracing

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm | StrOutputParser()

# This call is automatically traced in LangSmith
response = chain.invoke({"question": "What is RAG?"})

Custom Spans and Feedback

from langsmith import traceable, Client

client = Client()

@traceable(name="RAG Query", tags=["production"])
def rag_query(question: str, user_id: str) -> dict:
    contexts = retrieve_documents(question)
    response = generate_answer(question, contexts)
    return {"answer": response, "contexts": contexts, "user_id": user_id}

def record_user_feedback(run_id: str, score: float, comment: str = None):
    """Record user feedback for a traced run."""
    client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=score,
        comment=comment
    )

Langfuse: Open-Source Alternative

Langfuse provides similar functionality with the advantage of self-hosting options.

from langfuse.decorators import observe, langfuse_context
from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"  # or self-hosted
)

@observe(as_type="generation")
def generate_answer(prompt: str, context: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

@observe()
def rag_pipeline(question: str) -> dict:
    langfuse_context.update_current_observation(
        metadata={"question_length": len(question)}
    )
    contexts = retrieve_documents(question)
    langfuse_context.update_current_trace(
        metadata={"num_contexts": len(contexts)}
    )
    answer = generate_answer(question, "\n".join([c.text for c in contexts]))
    return {"answer": answer, "contexts": contexts}

# Log evaluation scores
def log_evaluation_scores(trace_id: str, faithfulness: float, relevancy: float):
    langfuse.score(trace_id=trace_id, name="faithfulness", value=faithfulness)
    langfuse.score(trace_id=trace_id, name="answer_relevancy", value=relevancy)

What to Log

Regardless of platform, log these essential metrics:

from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional, Dict, Any

@dataclass
class RAGTrace:
    """Complete trace for a RAG request."""
    # Request identification
    trace_id: str
    timestamp: datetime
    user_id: Optional[str]
    session_id: Optional[str]

    # Input
    question: str

    # Retrieval metrics
    retrieval_latency_ms: float
    num_docs_retrieved: int
    retrieval_scores: List[float]
    doc_ids: List[str]

    # Generation metrics
    generation_latency_ms: float
    input_tokens: int
    output_tokens: int
    model_name: str

    # Output
    answer: str

    # Quality scores (if available)
    faithfulness_score: Optional[float] = None
    relevancy_score: Optional[float] = None
    user_feedback: Optional[float] = None

    # Metadata
    metadata: Dict[str, Any] = None

Setting Up Alerts

from dataclasses import dataclass
from typing import Callable, List, Dict, Any
from enum import Enum

class AlertSeverity(Enum):
    INFO = "info"
    WARNING = "warning"
    CRITICAL = "critical"

@dataclass
class AlertRule:
    name: str
    metric: str
    condition: Callable[[float], bool]
    severity: AlertSeverity
    message_template: str

class RAGAlertManager:
    def __init__(self, notify: Callable[[str, AlertSeverity], None]):
        self.notify = notify
        self.rules = [
            AlertRule("high_latency", "avg_total_latency_ms",
                     lambda x: x > 5000, AlertSeverity.WARNING,
                     "Average latency {value:.0f}ms exceeds 5s"),
            AlertRule("low_faithfulness", "avg_faithfulness",
                     lambda x: x is not None and x < 0.7, AlertSeverity.WARNING,
                     "Faithfulness {value:.2f} below 0.7"),
            AlertRule("critical_faithfulness", "avg_faithfulness",
                     lambda x: x is not None and x < 0.5, AlertSeverity.CRITICAL,
                     "Faithfulness {value:.2f} indicates severe hallucination"),
        ]

    def check_metrics(self, metrics: Dict[str, Any]):
        for rule in self.rules:
            value = metrics.get(rule.metric)
            if value is not None and rule.condition(value):
                message = rule.message_template.format(value=value)
                self.notify(f"[{rule.name}] {message}", rule.severity)

Building Evaluation Pipelines

Creating Evaluation Datasets

from dataclasses import dataclass
from typing import List, Optional, Dict, Any
import json

@dataclass
class EvaluationExample:
    question: str
    ground_truth_answer: str
    relevant_doc_ids: List[str]
    category: str  # "policy", "technical", "general"
    difficulty: str  # "easy", "medium", "hard"
    metadata: Optional[Dict[str, Any]] = None

class EvaluationDatasetBuilder:
    def __init__(self):
        self.examples: List[EvaluationExample] = []

    def add_from_production(self, traces: List[RAGTrace], human_labels: Dict[str, dict]):
        for trace in traces:
            if trace.trace_id in human_labels:
                label = human_labels[trace.trace_id]
                self.examples.append(EvaluationExample(
                    question=trace.question,
                    ground_truth_answer=label["ground_truth"],
                    relevant_doc_ids=label.get("relevant_docs", []),
                    category=label.get("category", "general"),
                    difficulty=label.get("difficulty", "medium"),
                    metadata={"source": "production", "trace_id": trace.trace_id}
                ))

    def get_stratified_sample(self, n: int, by: str = "category") -> List[EvaluationExample]:
        groups = {}
        for ex in self.examples:
            key = getattr(ex, by)
            groups.setdefault(key, []).append(ex)

        sample = []
        per_group = n // len(groups)
        for examples in groups.values():
            sample.extend(random.sample(examples, min(per_group, len(examples))))
        return sample[:n]

    def save(self, filepath: str):
        with open(filepath, 'w') as f:
            json.dump([e.__dict__ for e in self.examples], f, indent=2)

Synthetic Data Generation

def generate_qa_pairs(document: str, doc_id: str, num_pairs: int = 5) -> List[EvaluationExample]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": f"""Generate {num_pairs} question-answer pairs from the document.
            Include difficulty levels (easy/medium/hard).
            Return JSON: {{"pairs": [{{"question": "...", "answer": "...", "difficulty": "..."}}]}}"""
        }, {
            "role": "user",
            "content": f"Document:\n\n{document}"
        }],
        response_format={"type": "json_object"}
    )

    pairs = json.loads(response.choices[0].message.content).get("pairs", [])
    return [
        EvaluationExample(
            question=p["question"],
            ground_truth_answer=p["answer"],
            relevant_doc_ids=[doc_id],
            category="general",
            difficulty=p.get("difficulty", "medium"),
            metadata={"source": "synthetic"}
        )
        for p in pairs
    ]

A/B Testing Retrieval Strategies

Compare different retrieval configurations to find the best approach:

from dataclasses import dataclass
from typing import List, Dict, Any, Callable
import random
from datetime import datetime

@dataclass
class RetrievalConfig:
    """Configuration for a retrieval strategy."""
    name: str
    k: int  # Number of documents to retrieve
    search_type: str  # "semantic", "hybrid", "keyword"
    reranker: bool
    chunk_size: int

class RetrievalABTest:
    """A/B testing framework for retrieval strategies."""

    def __init__(
        self,
        configs: List[RetrievalConfig],
        traffic_split: Dict[str, float] = None
    ):
        self.configs = {c.name: c for c in configs}
        # Default: even split
        if traffic_split is None:
            n = len(configs)
            self.traffic_split = {c.name: 1.0/n for c in configs}
        else:
            self.traffic_split = traffic_split

        self.results: Dict[str, List[Dict[str, Any]]] = {
            c.name: [] for c in configs
        }

    def get_config(self, user_id: str = None) -> RetrievalConfig:
        """Get configuration for a request using consistent hashing."""
        if user_id:
            hash_val = hash(user_id) % 100 / 100.0
        else:
            hash_val = random.random()

        cumulative = 0.0
        for name, weight in self.traffic_split.items():
            cumulative += weight
            if hash_val < cumulative:
                return self.configs[name]

        return list(self.configs.values())[0]

    def record_result(self, config_name: str, metrics: Dict[str, float]):
        """Record results for a configuration."""
        self.results[config_name].append({
            "timestamp": datetime.now().isoformat(),
            "metrics": metrics
        })

    def analyze_results(self) -> Dict[str, Any]:
        """Analyze A/B test results with statistical significance."""
        from scipy import stats
        import numpy as np

        analysis = {}
        for metric in ["faithfulness", "relevancy", "latency_ms"]:
            metric_results = {}

            for config_name, results in self.results.items():
                values = [r["metrics"].get(metric) for r in results if r["metrics"].get(metric)]
                if values:
                    metric_results[config_name] = {
                        "mean": np.mean(values),
                        "std": np.std(values),
                        "n": len(values)
                    }

            # Statistical comparison (if 2 configs)
            if len(metric_results) == 2:
                configs = list(metric_results.keys())
                t_stat, p_value = stats.ttest_ind(
                    [r["metrics"].get(metric) for r in self.results[configs[0]] if r["metrics"].get(metric)],
                    [r["metrics"].get(metric) for r in self.results[configs[1]] if r["metrics"].get(metric)]
                )
                metric_results["significant"] = p_value < 0.05
                metric_results["p_value"] = p_value

            analysis[metric] = metric_results

        return analysis


# Usage
ab_test = RetrievalABTest(
    configs=[
        RetrievalConfig("baseline", k=5, search_type="semantic", reranker=False, chunk_size=512),
        RetrievalConfig("hybrid_rerank", k=10, search_type="hybrid", reranker=True, chunk_size=512),
    ],
    traffic_split={"baseline": 0.5, "hybrid_rerank": 0.5}
)

# During query processing
config = ab_test.get_config(user_id="user_123")
# result = run_rag_with_config(query, config)
# ab_test.record_result(config.name, result.metrics)

# After sufficient data
analysis = ab_test.analyze_results()
print(f"Analysis: {analysis}")

CI/CD Integration

# .github/workflows/rag-quality.yml
name: RAG Quality Check

on:
  pull_request:
    paths: ['src/rag/**', 'prompts/**']

jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install -r requirements.txt pytest ragas deepeval

      - name: Run RAG quality tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          RAG_MIN_FAITHFULNESS: 0.75
          RAG_MIN_RELEVANCY: 0.70
        run: deepeval test run tests/test_rag_quality.py -v

      - name: Upload evaluation results
        uses: actions/upload-artifact@v4
        with:
          name: rag-evaluation-results
          path: ./rag_evaluations/

Common Gotchas

LLM-as-Judge Bias

When using LLMs to evaluate LLM outputs, several biases emerge:

Self-enhancement bias: Models rate their own outputs higher.

# Mitigation: Use a different model for evaluation
# If generating with GPT-4o, evaluate with Claude
evaluation_model = "claude-3-sonnet-20240229"

Verbosity bias: Longer answers get higher scores.

evaluation_prompt = """
Note: Longer answers are NOT inherently better.
A concise, accurate answer should score higher than a verbose, partially correct one.
"""

Position bias: In pairwise comparisons, models favor the first option.

# Mitigation: Randomize order and average
def compare_unbiased(a: str, b: str) -> dict:
    score_ab = evaluate_pair(a, b)
    score_ba = evaluate_pair(b, a)
    return {
        "a": (score_ab["a"] + score_ba["b"]) / 2,
        "b": (score_ab["b"] + score_ba["a"]) / 2
    }

Evaluation Dataset Drift

Your dataset becomes stale as:

Knowledge base grows but eval set does not cover new topics
User query patterns change
Edge cases discovered in production are not included

Solution: Periodically check coverage using embedding similarity between production queries and eval set.

Cost of Evaluation at Scale

# RAGAS with 4 metrics: ~4 LLM calls per example
# At $0.002/1K tokens, ~500 tokens/call:
# Cost per example: ~$0.004
# 10,000 queries/day: ~$40/day

class CostAwareEvaluator:
    def __init__(self, daily_budget_usd: float = 10.0):
        self.daily_budget = daily_budget_usd
        self.cost_per_eval = 0.004

    def should_evaluate(self, metadata: dict) -> bool:
        # Always evaluate high-priority
        if metadata.get("is_escalation") or metadata.get("user_tier") == "enterprise":
            return True
        # Sample 10% of remaining
        return random.random() < 0.1

Overfitting to Metrics

# Anti-pattern: Prompt engineered for high faithfulness
bad_prompt = """
Start every sentence with "According to the context..."
Never add any information not explicitly stated.
"""
# Scores 0.99 faithfulness but produces robotic answers

# Better: Balance automated metrics with human evaluation
composite_score = 0.6 * automated_score + 0.4 * human_rating_normalized

Production Checklist

Dataset Quality

Evaluation dataset has at least 100 examples
Examples cover all major query categories
Ground truth verified by domain experts
Includes adversarial/edge cases
Dataset refresh process documented

Metrics

Faithfulness measured (non-negotiable)
Retrieval metrics tracked separately
End-to-end metrics align with business goals
Human evaluation supplements automation

Observability

All production queries logged
Latency tracked at retrieval and generation stages
Token usage monitored for cost
User feedback collection implemented
Alerts configured for metric degradation

CI/CD

Evaluation on every RAG-affecting PR
Regression thresholds defined and enforced
Baseline metrics stored and versioned
Results archived for trend analysis

What Comes Next

You now have the tools to measure whether your RAG system works, not just whether it runs.

In Part 11, we will cover Production RAG at Scale: deployment patterns, caching strategies, and handling thousands of queries per second without breaking the bank.

The evaluation foundations from this article will be essential. You cannot optimize what you cannot measure.