RAG and Application Evaluation

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

pydantic-evals basics: Cases, Datasets, Evaluators (L11.01)
PydanticAI agents and structured outputs (L09.02)
RAG pipeline concepts (L10.01, L10.02)

Outcomes

Use LLMJudge to evaluate subjective output qualities via configurable rubrics
Distinguish assertion mode, score mode, and combined mode for different evaluation needs
Build RAGAS-style evaluators (faithfulness, answer relevance) as LLMJudge rubrics for RAG pipelines
Identify and mitigate common LLM-as-judge pitfalls (position bias, self-preference, verbosity bias)

References

Can an LLM Grade Another LLM?¶

In Part 01, we built evaluation Datasets with deterministic evaluators — EqualsExpected, Contains, custom ratio checks. These are fast, free, and perfectly reliable for what they measure. But we hit a wall: they can’t assess subjective quality.

Consider evaluating a chatbot response. We could check that it contains certain keywords (Contains), or that it’s the right type (IsInstance). But can we automatically check whether the response is helpful? Whether it addresses the user’s concern empathetically? Whether it’s faithful to the source documents?

These are judgment calls that require reading comprehension, reasoning, and contextual understanding — exactly the kind of thing LLMs are good at. So here’s the idea that has reshaped evaluation in the past two years: use an LLM as the evaluator. This is the LLM-as-judge pattern.

This is the LLMJudge evaluator in pydantic-evals, and it’s the centerpiece of today’s lecture.

The `LLMJudge` Evaluator¶

Setup¶

We’ll use our course LiteLLM proxy, just as in Weeks 8-10.

import os
from dotenv import load_dotenv
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Basic Assertion Mode¶

At its simplest, LLMJudge takes a rubric — a plain-English description of what “good” looks like — and returns a pass/fail verdict with reasoning.

To demonstrate how LLMJudge discriminates between good and bad outputs, we’ll package each Case’s input as a dict containing both the original article and the summary we want to judge. A simple identity function passes them through — the evaluation happens entirely in the evaluators.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

# Each Case's "input" is the output we want to judge
# We use metadata to carry the original article for context
quality_dataset = Dataset(
    name="summary_quality_check",
    cases=[
        Case(
            name="good_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday, citing persistent inflation. Markets fell sharply in response, with the S&P 500 dropping 2.1%.",
                "summary": "The Fed raised rates by 0.25%, citing inflation. Markets dropped sharply in response.",
            },
            evaluators=[
                LLMJudge(
                    rubric="The summary accurately captures the key facts from the article and is written in clear, professional language. It should mention the specific rate change and market reaction.",
                    include_input=True,
                    model=get_model("claude-haiku-4-5"),
                ),
            ],
        ),
        Case(
            name="vague_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday, citing persistent inflation. Markets fell sharply in response, with the S&P 500 dropping 2.1%.",
                "summary": "Something happened with the economy. Stocks moved.",
            },
            evaluators=[
                LLMJudge(
                    rubric="The summary accurately captures the key facts from the article and is written in clear, professional language. It should mention the specific rate change and market reaction.",
                    include_input=True,
                    model=get_model("claude-haiku-4-5"),
                ),
            ],
        ),
    ],
)


def passthrough(inputs: dict) -> dict:
    """Identity function — we're evaluating pre-existing outputs."""
    return inputs


report = await quality_dataset.evaluate(passthrough)
report.print(include_input=False, include_output=False, include_durations=False)

The LLM judge reads the rubric, examines the input (which contains both the article and the summary), and decides whether the rubric is satisfied. The good summary should pass; the vague one should fail — and the reasoning will explain why.

Figure 1:The LLMJudge workflow: the evaluator sends the output (and optionally the input and expected output) along with the rubric to a judge LLM, which returns a verdict with reasoning.

Score Mode vs. Assertion Mode¶

The default assertion mode gives pass/fail. But sometimes you want a continuous score — for example, when comparing multiple RAG configurations and you need to rank them. You configure the mode via the score and assertion parameters:

# Score mode: returns 0.0-1.0 instead of pass/fail
score_judge = LLMJudge(
    rubric="Rate the summary quality: specificity of facts, clarity of language, and completeness of key information.",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "quality_score"},
    assertion=False,  # Disable pass/fail, only return score
)

# Combined mode: get both a score AND a pass/fail assertion
combined_judge = LLMJudge(
    rubric="The summary must be factually accurate and mention specific numbers from the source.",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": False, "evaluation_name": "accuracy"},
    assertion={"include_reason": True, "evaluation_name": "factual_check"},
)

The three modes serve different purposes:

Mode	Returns	Best for
Assertion (default)	pass/fail + reason	CI/CD gates, regression testing
Score	0.0–1.0 + reason	Comparing configurations, tracking quality over time
Combined	Both	Production monitoring with alerting thresholds

Combining Deterministic and LLM Evaluators¶

Here’s the key design pattern: use deterministic evaluators as fast, cheap pre-filters, and LLMJudge for what they can’t catch. Deterministic checks run in microseconds and cost nothing. LLMJudge requires an API call. Layer them wisely.

from pydantic_evals.evaluators import Contains, IsInstance

# Layered evaluation: cheap checks first, LLM judge for subjective quality
layered_dataset = Dataset(
    name="layered_summary_eval",
    cases=[
        Case(
            name="fed_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday. Chair Powell cited persistent inflation. The S&P 500 fell 2.1%.",
                "summary": "The Fed raised rates by 0.25%, citing inflation pressures. Markets responded with a sharp decline.",
            },
        ),
    ],
    evaluators=[
        # Layer 1: Type check (instant, free)
        IsInstance(type_name="dict"),
        # Layer 2: Subjective quality (requires LLM call)
        LLMJudge(
            rubric="The summary is factually faithful to the article and written in clear, professional language.",
            include_input=True,
            model=get_model("claude-haiku-4-5"),
        ),
    ],
)

report = await layered_dataset.evaluate(passthrough)
report.print(include_durations=False)

Exercise 11.3: Customer Support Evaluator

Build an LLMJudge-based evaluation Dataset for a customer support chatbot. Create 3 Cases:

A customer asking for a refund (the bot should be empathetic and mention the return policy)
A customer asking about shipping status (the bot should be factual and provide tracking info)
An angry customer (the bot should de-escalate and avoid being defensive)

For each Case, write a rubric that captures the expected behavior. Use include_input=True so the judge can see the customer’s message.

Starter code:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

# Use get_model("claude-haiku-4-5") for the judge

cases = [
    Case(
        name="refund_request",
        inputs={
            "customer_message": "I want my money back for order #12345",
            "bot_response": "I understand your frustration. Let me help with that. Under our 30-day return policy, you're eligible for a full refund. I'll start the process now — you should see the credit within 3-5 business days.",
        },
        evaluators=[
            LLMJudge(
                rubric="""TODO: Write a rubric that checks for:
                1. Empathetic acknowledgment
                2. Mention of return policy
                3. Clear next steps""",
                include_input=True,
                model=get_model("claude-haiku-4-5"),
            ),
        ],
    ),
    # TODO: Add 2 more Cases for shipping status and angry customer
]

dataset = Dataset(name="support_eval", cases=cases)
report = await dataset.evaluate(passthrough)
report.print(include_durations=False)

Evaluating RAG Pipelines¶

Now let’s apply LLMJudge to the problem we’ve been building toward: evaluating RAG systems. In Week 10, we built RAG pipelines and evaluated them manually with keyword checks and 1–3 scale scores. Now we can automate that with principled metrics.

How do we know if switching from fixed-size chunks to sentence-aware chunks actually improved our retrieval? Or whether adding reranking helped? Before we evaluate the generated answer, we need to be able to evaluate the retrieved context. Classical information retrieval, IR has well-established metrics for this:

Precision@k — Of the top-k retrieved documents, what fraction is relevant?

\text{Precision@k} = \frac{|\text{relevant docs in top-k}|}{k}

(1)

Recall@k — Of all relevant documents in the corpus, what fraction did we retrieve in the top-k?

\text{Recall@k} = \frac{|\text{relevant docs in top-k}|}{|\text{all relevant docs}|}

(2)

MRR (Mean Reciprocal Rank) — On average, what position is the first relevant result?

\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

(3)

These are useful for tuning your retrieval pipeline (chunking strategy, embedding model, hybrid search weights), but they don’t tell you whether the final answer is any good. For that, we need generation-quality metrics.

The RAGAS Framework¶

RAGAS (Retrieval Augmented Generation Assessment) introduced four metrics that together cover the full RAG pipeline. We won’t install the RAGAS library — instead, we’ll implement these as LLMJudge rubrics, which is more instructive and stays within our stack.

Figure 2:The four RAGAS metrics form a 2x2 grid: retrieval vs. generation quality, each measured by precision-like and recall-like metrics. Together they diagnose where a RAG pipeline is failing.

The four metrics and what they catch:

Metric	What it measures	Failure it detects
Faithfulness	Is the answer grounded in the retrieved context?	Hallucination — answer invents facts not in context
Answer Relevance	Does the answer address the original question?	Tangential — answer discusses retrieved content but doesn’t answer the question
Context Precision	Are the retrieved chunks relevant to the question?	Noise — retriever returns irrelevant documents
Context Recall	Did we retrieve everything needed to answer?	Gaps — answer requires information not in the retrieved context

Building RAGAS-Style Evaluators¶

Let’s build a mini RAG evaluation. We’ll define a structured input that captures the full RAG triple: question, retrieved context, and generated answer.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

# Define the RAGAS-style judges
faithfulness_judge = LLMJudge(
    rubric="""Evaluate FAITHFULNESS: Is the generated answer fully supported by the
retrieved context? Every claim in the answer must be traceable to the context.
Score 0.0 if the answer contains fabricated facts not in the context.
Score 0.5 if some claims are supported but others are not.
Score 1.0 if every claim is grounded in the context.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "faithfulness"},
    assertion=False,
)

relevance_judge = LLMJudge(
    rubric="""Evaluate ANSWER RELEVANCE: Does the generated answer directly address
the original question? The answer should be on-topic and useful.
Score 0.0 if the answer is completely off-topic.
Score 0.5 if the answer is partially relevant but misses the main point.
Score 1.0 if the answer fully addresses the question.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "relevance"},
    assertion=False,
)

context_precision_judge = LLMJudge(
    rubric="""Evaluate CONTEXT PRECISION: Are the retrieved context chunks relevant
to the question? Relevant chunks contain information needed to answer the question.
Score 0.0 if none of the context is relevant.
Score 0.5 if some chunks are relevant but others are noise.
Score 1.0 if all retrieved chunks are relevant to the question.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "context_precision"},
    assertion=False,
)

# A well-functioning RAG example
good_rag_case = Case(
    name="good_rag",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "The transformer architecture was introduced in 2017, in the landmark paper 'Attention Is All You Need' by Vaswani et al.",
    },
)

# A hallucinating RAG example — answer adds facts not in the context
hallucinating_case = Case(
    name="hallucination",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "The transformer was introduced in 2017 by Google Brain. It was trained on 8 TPU v3 pods and took approximately 3.5 days to train. The paper received over 100,000 citations.",
    },
)

# An off-topic RAG example — answer doesn't address the question
off_topic_case = Case(
    name="off_topic",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "Self-attention mechanisms allow each token in a sequence to attend to every other token. Multi-head attention runs several attention operations in parallel, capturing different relationship types.",
    },
)

rag_eval_dataset = Dataset(
    name="rag_quality_eval",
    cases=[good_rag_case, hallucinating_case, off_topic_case],
    evaluators=[faithfulness_judge, relevance_judge, context_precision_judge],
)

report = await rag_eval_dataset.evaluate(passthrough)
report.print(include_input=False, include_output=False, include_durations=False)

Look at the scores. Your exact numbers will vary from run to run (LLM judges are non-deterministic), but well-calibrated rubrics should produce scores in roughly these ranges:

good_rag: High faithfulness (~1.0), high relevance (~1.0), high context precision (~1.0)
hallucination: Low faithfulness (~0.5 — some claims are real, some fabricated), high relevance (~1.0 — it does answer the question)
off_topic: Moderate-to-high faithfulness (the content is from the context, though the judge may penalize it for not being a direct answer), low relevance (~0.0 — it doesn’t answer the question)

This is the power of separating faithfulness from relevance. A response can be faithful but irrelevant (reciting context without answering the question), or relevant but unfaithful (answering the question with hallucinated facts). You need both metrics to diagnose what’s going wrong.

Exercise 11.4: Diagnosing RAG Failures

For each scenario below, predict which RAGAS metric(s) would score low and explain why:

A RAG system retrieves passages about Python programming when asked “What is the capital of France?”
The system retrieves the correct Wikipedia article about Paris, but the LLM answers “Paris is known for its beautiful weather year-round” (Paris has notably variable weather)
The system retrieves only the first paragraph of a long article, and the answer is correct but incomplete
The system retrieves relevant passages and generates “Based on the provided context, the capital of France is Paris, which has been the capital since the 10th century” — but the context never mentions “10th century”
The system retrieves the right passages and the LLM gives a detailed, correct answer, but the answer is 500 words when a single sentence would suffice

LLM-as-Judge: Pitfalls and Best Practices¶

LLM-as-judge is powerful, but it comes with systematic biases that you need to be aware of. Treating LLMJudge scores as ground truth without understanding these pitfalls is a recipe for misleading evaluations.

Figure 3:Three common biases in LLM-as-judge evaluation. Awareness of these biases is the first step to mitigating them.

Position Bias¶

When an LLM is asked to compare two responses (A vs. B), it systematically prefers whichever is presented first. Studies have shown that simply swapping the order of responses can flip the verdict. This matters less for our single-output rubric evaluations, but it’s critical if you build comparative evaluators.

Self-Preference Bias¶

Models tend to rate outputs from their own model family higher. Claude judges might prefer Claude-generated text; GPT judges might prefer GPT-generated text. The mitigation: use a different model as the judge than the one that generated the output. In our stack, if you’re generating with claude-sonnet-4-6, consider judging with gpt-5.4 (or vice versa).

Verbosity Bias¶

Longer, more detailed responses tend to receive higher scores regardless of whether the extra detail adds value. A one-sentence correct answer often scores lower than a three-paragraph answer that says the same thing with padding. The mitigation: include explicit length guidance in your rubric (e.g., “penalize unnecessary verbosity”).

Best Practices Checklist¶

When using LLMJudge in production, follow these guidelines:

Use a stronger model as judge — If your app uses Haiku, judge with Sonnet. The judge should be at least as capable as the model being judged.
Write specific rubrics — “Is this response good?” is too vague. “The response must cite at least two specific facts from the context and directly answer the user’s question” is actionable.
Always layer with deterministic checks — Use Contains, IsInstance, and custom evaluators for everything that can be checked mechanically. Only use LLMJudge for what remains.
Test the judge itself — Run your evaluation on known-good and known-bad examples. If the judge can’t distinguish them, your rubric needs work.
Don’t compare scores across rubrics or models — A 0.8 from one rubric is not equivalent to a 0.8 from another. Scores are ordinal within a single evaluator configuration, not cardinal.

Wrap-Up¶

Key Takeaways¶

Key Takeaways

LLMJudge uses an LLM to evaluate LLM outputs based on a plain-English rubric — bridging the gap between cheap deterministic checks and expensive human evaluation
Three output modes serve different needs: assertion (pass/fail for CI gates), score (0.0–1.0 for ranking), and combined (both for production monitoring)
include_input=True is almost always what you want — the judge needs to see the original question/context to assess faithfulness and relevance
RAGAS-style metrics separate four failure modes: faithfulness (hallucination), answer relevance (off-topic), context precision (noisy retrieval), and context recall (incomplete retrieval)
The key diagnostic insight: a response can be faithful but irrelevant, or relevant but unfaithful — you need both metrics to find the root cause of RAG failures
Layer your evaluators: deterministic checks first (free, instant), LLMJudge for subjective quality (costs money, takes time)
LLM judges have systematic biases — position bias, self-preference, and verbosity bias. Mitigate with stronger judge models, specific rubrics, and deterministic pre-filters

What’s Next¶

In Part 03, we’ll put everything together in the Red Team Challenge lab. You’ll red-team an LLM application by finding its failure modes, build adversarial test Datasets with YAML serialization, and create a multi-evaluator pipeline that combines the deterministic evaluators from Part 01 with the LLMJudge patterns from today. We’ll also preview span-based evaluation — checking not just what an agent outputs, but how it got there — setting up Week 12’s work on building agents.