Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

RAG and Application Evaluation

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Can an LLM Grade Another LLM?

In Part 01, we built evaluation Datasets with deterministic evaluators — EqualsExpected, Contains, custom ratio checks. These are fast, free, and perfectly reliable for what they measure. But we hit a wall: they can’t assess subjective quality.

Consider evaluating a chatbot response. We could check that it contains certain keywords (Contains), or that it’s the right type (IsInstance). But can we automatically check whether the response is helpful? Whether it addresses the user’s concern empathetically? Whether it’s faithful to the source documents?

These are judgment calls that require reading comprehension, reasoning, and contextual understanding — exactly the kind of thing LLMs are good at. So here’s the idea that has reshaped evaluation in the past two years: use an LLM as the evaluator. This is the LLM-as-judge pattern.

This is the LLMJudge evaluator in pydantic-evals, and it’s the centerpiece of today’s lecture.

The LLMJudge Evaluator

Setup

We’ll use our course LiteLLM proxy, just as in Weeks 8-10.

import os
from dotenv import load_dotenv
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Basic Assertion Mode

At its simplest, LLMJudge takes a rubric — a plain-English description of what “good” looks like — and returns a pass/fail verdict with reasoning.

To demonstrate how LLMJudge discriminates between good and bad outputs, we’ll package each Case’s input as a dict containing both the original article and the summary we want to judge. A simple identity function passes them through — the evaluation happens entirely in the evaluators.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge
# Each Case's "input" is the output we want to judge
# We use metadata to carry the original article for context
quality_dataset = Dataset(
    name="summary_quality_check",
    cases=[
        Case(
            name="good_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday, citing persistent inflation. Markets fell sharply in response, with the S&P 500 dropping 2.1%.",
                "summary": "The Fed raised rates by 0.25%, citing inflation. Markets dropped sharply in response.",
            },
            evaluators=[
                LLMJudge(
                    rubric="The summary accurately captures the key facts from the article and is written in clear, professional language. It should mention the specific rate change and market reaction.",
                    include_input=True,
                    model=get_model("claude-haiku-4-5"),
                ),
            ],
        ),
        Case(
            name="vague_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday, citing persistent inflation. Markets fell sharply in response, with the S&P 500 dropping 2.1%.",
                "summary": "Something happened with the economy. Stocks moved.",
            },
            evaluators=[
                LLMJudge(
                    rubric="The summary accurately captures the key facts from the article and is written in clear, professional language. It should mention the specific rate change and market reaction.",
                    include_input=True,
                    model=get_model("claude-haiku-4-5"),
                ),
            ],
        ),
    ],
)


def passthrough(inputs: dict) -> dict:
    """Identity function — we're evaluating pre-existing outputs."""
    return inputs


report = await quality_dataset.evaluate(passthrough)
report.print(include_input=False, include_output=False, include_durations=False)
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

The LLM judge reads the rubric, examines the input (which contains both the article and the summary), and decides whether the rubric is satisfied. The good summary should pass; the vague one should fail — and the reasoning will explain why.

The LLMJudge workflow: the evaluator sends the output (and optionally the input and expected output) along with the rubric to a judge LLM, which returns a verdict with reasoning.

Figure 1:The LLMJudge workflow: the evaluator sends the output (and optionally the input and expected output) along with the rubric to a judge LLM, which returns a verdict with reasoning.

Score Mode vs. Assertion Mode

The default assertion mode gives pass/fail. But sometimes you want a continuous score — for example, when comparing multiple RAG configurations and you need to rank them. You configure the mode via the score and assertion parameters:

# Score mode: returns 0.0-1.0 instead of pass/fail
score_judge = LLMJudge(
    rubric="Rate the summary quality: specificity of facts, clarity of language, and completeness of key information.",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "quality_score"},
    assertion=False,  # Disable pass/fail, only return score
)

# Combined mode: get both a score AND a pass/fail assertion
combined_judge = LLMJudge(
    rubric="The summary must be factually accurate and mention specific numbers from the source.",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": False, "evaluation_name": "accuracy"},
    assertion={"include_reason": True, "evaluation_name": "factual_check"},
)

The three modes serve different purposes:

ModeReturnsBest for
Assertion (default)pass/fail + reasonCI/CD gates, regression testing
Score0.0–1.0 + reasonComparing configurations, tracking quality over time
CombinedBothProduction monitoring with alerting thresholds

Combining Deterministic and LLM Evaluators

Here’s the key design pattern: use deterministic evaluators as fast, cheap pre-filters, and LLMJudge for what they can’t catch. Deterministic checks run in microseconds and cost nothing. LLMJudge requires an API call. Layer them wisely.

from pydantic_evals.evaluators import Contains, IsInstance

# Layered evaluation: cheap checks first, LLM judge for subjective quality
layered_dataset = Dataset(
    name="layered_summary_eval",
    cases=[
        Case(
            name="fed_summary",
            inputs={
                "article": "The Federal Reserve raised interest rates by 0.25% on Wednesday. Chair Powell cited persistent inflation. The S&P 500 fell 2.1%.",
                "summary": "The Fed raised rates by 0.25%, citing inflation pressures. Markets responded with a sharp decline.",
            },
        ),
    ],
    evaluators=[
        # Layer 1: Type check (instant, free)
        IsInstance(type_name="dict"),
        # Layer 2: Subjective quality (requires LLM call)
        LLMJudge(
            rubric="The summary is factually faithful to the article and written in clear, professional language.",
            include_input=True,
            model=get_model("claude-haiku-4-5"),
        ),
    ],
)

report = await layered_dataset.evaluate(passthrough)
report.print(include_durations=False)
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Evaluating RAG Pipelines

Now let’s apply LLMJudge to the problem we’ve been building toward: evaluating RAG systems. In Week 10, we built RAG pipelines and evaluated them manually with keyword checks and 1–3 scale scores. Now we can automate that with principled metrics.

Retrieval Metrics: A Brief Sidebar

How do we know if switching from fixed-size chunks to sentence-aware chunks actually improved our retrieval? Or whether adding reranking helped? Before we evaluate the generated answer, we need to be able to evaluate the retrieved context. Classical information retrieval, IR has well-established metrics for this:

Precision@k — Of the top-k retrieved documents, what fraction is relevant?

Precision@k=relevant docs in top-kk\text{Precision@k} = \frac{|\text{relevant docs in top-k}|}{k}

Recall@k — Of all relevant documents in the corpus, what fraction did we retrieve in the top-k?

Recall@k=relevant docs in top-kall relevant docs\text{Recall@k} = \frac{|\text{relevant docs in top-k}|}{|\text{all relevant docs}|}

MRR (Mean Reciprocal Rank) — On average, what position is the first relevant result?

MRR=1Qi=1Q1ranki\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

These are useful for tuning your retrieval pipeline (chunking strategy, embedding model, hybrid search weights), but they don’t tell you whether the final answer is any good. For that, we need generation-quality metrics.

The RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) introduced four metrics that together cover the full RAG pipeline. We won’t install the RAGAS library — instead, we’ll implement these as LLMJudge rubrics, which is more instructive and stays within our stack.

The four RAGAS metrics form a 2x2 grid: retrieval vs. generation quality, each measured by precision-like and recall-like metrics. Together they diagnose where a RAG pipeline is failing.

Figure 2:The four RAGAS metrics form a 2x2 grid: retrieval vs. generation quality, each measured by precision-like and recall-like metrics. Together they diagnose where a RAG pipeline is failing.

The four metrics and what they catch:

MetricWhat it measuresFailure it detects
FaithfulnessIs the answer grounded in the retrieved context?Hallucination — answer invents facts not in context
Answer RelevanceDoes the answer address the original question?Tangential — answer discusses retrieved content but doesn’t answer the question
Context PrecisionAre the retrieved chunks relevant to the question?Noise — retriever returns irrelevant documents
Context RecallDid we retrieve everything needed to answer?Gaps — answer requires information not in the retrieved context

Building RAGAS-Style Evaluators

Let’s build a mini RAG evaluation. We’ll define a structured input that captures the full RAG triple: question, retrieved context, and generated answer.

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

# Define the RAGAS-style judges
faithfulness_judge = LLMJudge(
    rubric="""Evaluate FAITHFULNESS: Is the generated answer fully supported by the
retrieved context? Every claim in the answer must be traceable to the context.
Score 0.0 if the answer contains fabricated facts not in the context.
Score 0.5 if some claims are supported but others are not.
Score 1.0 if every claim is grounded in the context.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "faithfulness"},
    assertion=False,
)

relevance_judge = LLMJudge(
    rubric="""Evaluate ANSWER RELEVANCE: Does the generated answer directly address
the original question? The answer should be on-topic and useful.
Score 0.0 if the answer is completely off-topic.
Score 0.5 if the answer is partially relevant but misses the main point.
Score 1.0 if the answer fully addresses the question.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "relevance"},
    assertion=False,
)

context_precision_judge = LLMJudge(
    rubric="""Evaluate CONTEXT PRECISION: Are the retrieved context chunks relevant
to the question? Relevant chunks contain information needed to answer the question.
Score 0.0 if none of the context is relevant.
Score 0.5 if some chunks are relevant but others are noise.
Score 1.0 if all retrieved chunks are relevant to the question.""",
    include_input=True,
    model=get_model("claude-haiku-4-5"),
    score={"include_reason": True, "evaluation_name": "context_precision"},
    assertion=False,
)
# A well-functioning RAG example
good_rag_case = Case(
    name="good_rag",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "The transformer architecture was introduced in 2017, in the landmark paper 'Attention Is All You Need' by Vaswani et al.",
    },
)

# A hallucinating RAG example — answer adds facts not in the context
hallucinating_case = Case(
    name="hallucination",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "The transformer was introduced in 2017 by Google Brain. It was trained on 8 TPU v3 pods and took approximately 3.5 days to train. The paper received over 100,000 citations.",
    },
)

# An off-topic RAG example — answer doesn't address the question
off_topic_case = Case(
    name="off_topic",
    inputs={
        "question": "What year was the transformer architecture introduced?",
        "retrieved_context": [
            "The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al.",
            "Transformers replaced recurrent architectures by using self-attention to process all positions in parallel.",
        ],
        "generated_answer": "Self-attention mechanisms allow each token in a sequence to attend to every other token. Multi-head attention runs several attention operations in parallel, capturing different relationship types.",
    },
)
rag_eval_dataset = Dataset(
    name="rag_quality_eval",
    cases=[good_rag_case, hallucinating_case, off_topic_case],
    evaluators=[faithfulness_judge, relevance_judge, context_precision_judge],
)

report = await rag_eval_dataset.evaluate(passthrough)
report.print(include_input=False, include_output=False, include_durations=False)
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Look at the scores. Your exact numbers will vary from run to run (LLM judges are non-deterministic), but well-calibrated rubrics should produce scores in roughly these ranges:

This is the power of separating faithfulness from relevance. A response can be faithful but irrelevant (reciting context without answering the question), or relevant but unfaithful (answering the question with hallucinated facts). You need both metrics to diagnose what’s going wrong.

LLM-as-Judge: Pitfalls and Best Practices

LLM-as-judge is powerful, but it comes with systematic biases that you need to be aware of. Treating LLMJudge scores as ground truth without understanding these pitfalls is a recipe for misleading evaluations.

Three common biases in LLM-as-judge evaluation. Awareness of these biases is the first step to mitigating them.

Figure 3:Three common biases in LLM-as-judge evaluation. Awareness of these biases is the first step to mitigating them.

Position Bias

When an LLM is asked to compare two responses (A vs. B), it systematically prefers whichever is presented first. Studies have shown that simply swapping the order of responses can flip the verdict. This matters less for our single-output rubric evaluations, but it’s critical if you build comparative evaluators.

Self-Preference Bias

Models tend to rate outputs from their own model family higher. Claude judges might prefer Claude-generated text; GPT judges might prefer GPT-generated text. The mitigation: use a different model as the judge than the one that generated the output. In our stack, if you’re generating with claude-sonnet-4-6, consider judging with gpt-5.4 (or vice versa).

Verbosity Bias

Longer, more detailed responses tend to receive higher scores regardless of whether the extra detail adds value. A one-sentence correct answer often scores lower than a three-paragraph answer that says the same thing with padding. The mitigation: include explicit length guidance in your rubric (e.g., “penalize unnecessary verbosity”).

Best Practices Checklist

When using LLMJudge in production, follow these guidelines:

  1. Use a stronger model as judge — If your app uses Haiku, judge with Sonnet. The judge should be at least as capable as the model being judged.

  2. Write specific rubrics — “Is this response good?” is too vague. “The response must cite at least two specific facts from the context and directly answer the user’s question” is actionable.

  3. Always layer with deterministic checks — Use Contains, IsInstance, and custom evaluators for everything that can be checked mechanically. Only use LLMJudge for what remains.

  4. Test the judge itself — Run your evaluation on known-good and known-bad examples. If the judge can’t distinguish them, your rubric needs work.

  5. Don’t compare scores across rubrics or models — A 0.8 from one rubric is not equivalent to a 0.8 from another. Scores are ordinal within a single evaluator configuration, not cardinal.

Wrap-Up

Key Takeaways

What’s Next

In Part 03, we’ll put everything together in the Red Team Challenge lab. You’ll red-team an LLM application by finding its failure modes, build adversarial test Datasets with YAML serialization, and create a multi-evaluator pipeline that combines the deterministic evaluators from Part 01 with the LLMJudge patterns from today. We’ll also preview span-based evaluation — checking not just what an agent outputs, but how it got there — setting up Week 12’s work on building agents.