Lab — Build Your Agent

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

Agent fundamentals, RunContext, agent loop (L12.01)
Tools, ModelRetry, memory strategies (L12.02)
pydantic-evals evaluation framework (L11.01–L11.03)

Outcomes

Build a complete PydanticAI agent with 3+ tools, dependency injection, and ModelRetry error handling
Add multi-turn conversation memory and verify context retention across turns
Evaluate agent behavior using pydantic-evals — testing tool selection, output quality, and edge cases

References

Lab Overview¶

Today we put together everything from L12.01 and L12.02. You’ll build a complete data analysis agent from scratch — designing tools, wiring up dependency injection, adding memory, and evaluating the result with the pydantic-evals framework from Week 11.

The exercises are open-ended: we provide the dataset and infrastructure, you design the agent. There are many valid approaches — the goal is to practice the patterns, not to arrive at one “correct” implementation.

Setup¶

The shared infrastructure below gives you everything you need to get started. Run these cells first.

Model and Proxy¶

import os
import statistics
from dataclasses import dataclass, field
from datetime import date

from dotenv import load_dotenv
from pydantic_ai import Agent, ModelRetry, RunContext
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Dataset¶

We’ll use a richer dataset than L12.01 — revenue broken out by product and region, giving your agent more to work with:

# Revenue data: quarter -> product -> region -> monthly revenues (in $M)
COMPANY_DATA = {
    "Q1": {
        "Widget Pro": {"North": [1.2, 1.3, 1.1], "South": [0.8, 0.9, 0.7]},
        "Widget Lite": {"North": [0.5, 0.6, 0.5], "South": [0.3, 0.3, 0.4]},
        "Enterprise Suite": {"North": [2.0, 2.1, 2.2], "South": [1.5, 1.4, 1.6]},
    },
    "Q2": {
        "Widget Pro": {"North": [1.4, 1.5, 1.3], "South": [0.9, 1.0, 0.8]},
        "Widget Lite": {"North": [0.6, 0.7, 0.6], "South": [0.4, 0.4, 0.5]},
        "Enterprise Suite": {"North": [2.3, 2.4, 2.5], "South": [1.7, 1.6, 1.8]},
    },
    "Q3": {
        "Widget Pro": {"North": [1.5, 1.6, 1.7], "South": [1.0, 1.1, 0.9]},
        "Widget Lite": {"North": [0.7, 0.8, 0.7], "South": [0.5, 0.5, 0.6]},
        "Enterprise Suite": {"North": [2.5, 2.6, 2.8], "South": [1.8, 1.9, 2.0]},
    },
    "Q4": {
        "Widget Pro": {"North": [1.8, 1.9, 2.0], "South": [1.2, 1.3, 1.1]},
        "Widget Lite": {"North": [0.8, 0.9, 0.8], "South": [0.6, 0.6, 0.7]},
        "Enterprise Suite": {"North": [3.0, 3.1, 3.3], "South": [2.1, 2.2, 2.4]},
    },
}

PRODUCTS = list(COMPANY_DATA["Q1"].keys())
REGIONS = ["North", "South"]
QUARTERS = list(COMPANY_DATA.keys())

print(f"Products: {PRODUCTS}")
print(f"Regions: {REGIONS}")
print(f"Quarters: {QUARTERS}")

Products: ['Widget Pro', 'Widget Lite', 'Enterprise Suite']
Regions: ['North', 'South']
Quarters: ['Q1', 'Q2', 'Q3', 'Q4']

Dependencies¶

A deps dataclass is provided. Feel free to extend it if your agent design needs additional fields.

@dataclass
class SalesDeps:
    """External state for the sales analysis agent."""
    db: dict                          # the COMPANY_DATA dict
    user_name: str = "Analyst"
    available_quarters: list[str] = field(default_factory=lambda: list(QUARTERS))
    available_products: list[str] = field(default_factory=lambda: list(PRODUCTS))
    available_regions: list[str] = field(default_factory=lambda: list(REGIONS))


deps = SalesDeps(db=COMPANY_DATA)
print(f"Deps created for {deps.user_name}")
print(f"  Quarters: {deps.available_quarters}")
print(f"  Products: {deps.available_products}")
print(f"  Regions: {deps.available_regions}")

Deps created for Analyst
  Quarters: ['Q1', 'Q2', 'Q3', 'Q4']
  Products: ['Widget Pro', 'Widget Lite', 'Enterprise Suite']
  Regions: ['North', 'South']

Helper Functions¶

A few utility functions to make working with the nested data easier. Your tools can use these internally:

def get_revenue(db: dict, quarter: str, product: str = None, region: str = None) -> list[float]:
    """Extract monthly revenue from the nested data structure.

    Returns a flat list of monthly revenue values, filtered by product/region if specified.
    """
    if quarter not in db:
        return []

    revenues = []
    for prod, regions in db[quarter].items():
        if product and prod != product:
            continue
        for reg, monthly in regions.items():
            if region and reg != region:
                continue
            revenues.extend(monthly)
    return revenues


def summarize_revenue(values: list[float]) -> str:
    """Format a revenue summary string."""
    if not values:
        return "No data found"
    total = sum(values)
    avg = statistics.mean(values)
    return f"total=${total:.1f}M, avg=${avg:.2f}M, min=${min(values):.1f}M, max=${max(values):.1f}M"


# Quick test
q1_all = get_revenue(COMPANY_DATA, "Q1")
print(f"Q1 all revenue: {summarize_revenue(q1_all)}")

q1_widget_north = get_revenue(COMPANY_DATA, "Q1", product="Widget Pro", region="North")
print(f"Q1 Widget Pro (North): {summarize_revenue(q1_widget_north)}")

Q1 all revenue: total=$19.4M, avg=$1.08M, min=$0.3M, max=$2.2M
Q1 Widget Pro (North): total=$3.6M, avg=$1.20M, min=$1.1M, max=$1.3M

Exercise 12.6: Build Your Multi-Tool Agent¶

Exercise 12.6: Build a Multi-Tool Sales Agent

Build a PydanticAI agent with at least 3 tools for analyzing the company sales data. Your agent should use RunContext[SalesDeps] for dependency injection and ModelRetry for error handling on at least one tool.

Requirements¶

At least 3 tools — each with a clear docstring that guides the LLM. Some ideas:
- query_revenue(quarter, product?, region?) — look up revenue with optional filters
- compare_quarters(q1, q2, product?, region?) — compute growth between quarters
- find_top_performer(quarter, by="product"|"region") — find the best-selling product or region
- compute_trend(product?, region?) — compute quarter-over-quarter trend across all quarters
- Or design your own!
ModelRetry on at least one tool — validate inputs (quarter names, product names, region names) and raise ModelRetry with helpful feedback when inputs are invalid.
Dynamic instructions — use @agent.instructions to inject the current date, user name, and available filters.
Test with 3+ queries — include at least one that requires multiple tool calls (e.g., “Compare Widget Pro to Enterprise Suite in Q4”). Inspect result.new_messages() on at least one query to verify the tools were called correctly.

Starter Code¶

sales_agent = Agent(
    get_model("claude-haiku-4-5"),
    deps_type=SalesDeps,
    system_prompt=(
        "You are a sales analysis assistant for a product company. "
        "Always use your tools to look up data — never guess or make up numbers. "
        "Be concise, cite specific figures, and highlight key insights."
    ),
)


@sales_agent.instructions
def inject_context(ctx: RunContext[SalesDeps]) -> str:
    # TODO: return a string with current date, user name, and available filters
    ...


# TODO: Define at least 3 tools with @sales_agent.tool
# TODO: Use ModelRetry for input validation on at least one tool

Test Your Agent¶

# Test query 1: Simple lookup
result1 = await sales_agent.run("What was total Q4 revenue?", deps=deps)
print("Q1:", result1.output)

# Test query 2: Comparison (may need multiple tool calls)
result2 = await sales_agent.run(
    "Compare North vs South region revenue in Q3",
    deps=deps,
)
print("Q2:", result2.output)

# Test query 3: Multi-step analysis
result3 = await sales_agent.run(
    "Which product grew the most from Q1 to Q4?",
    deps=deps,
)
print("Q3:", result3.output)

# Inspect messages on one query to see tool calls
print("\n--- Message trace for query 3 ---")
for msg in result3.new_messages():
    for part in msg.parts:
        print(f"  {type(part).__name__}: {str(part)[:120]}")
    print()

Exercise 12.7: Add Memory and Multi-Turn Conversation¶

Exercise 12.7: Multi-Turn Analysis Session

Add conversation memory to your agent so a user can have a multi-turn analysis session.

Requirements¶

Build a 4+ turn conversation where later turns reference earlier answers. For example:
- Turn 1: “What was Q4 revenue for Enterprise Suite?”
- Turn 2: “How does that compare to Q3?” (references “that” from turn 1)
- Turn 3: “Which region contributed more?”
- Turn 4: “Summarize our findings so far”
Chain conversations using message_history=result.all_messages().

Implement a memory strategy — use sliding window to keep the history bounded. Write a sliding_window helper function:

def sliding_window(messages: list, max_messages: int = 20) -> list:
    """Keep only the most recent messages."""
    # TODO: implement
    ...

Verify context retention — confirm the agent correctly resolves references to earlier turns (e.g., “that”, “the same product”, “compare to what we just discussed”).
Bonus: Serialize the conversation to disk after the final turn, then reload it and ask one more question to prove persistence works.

Starter Code¶

async def analysis_session(agent, deps, questions: list[str], max_history: int = 20):
    """Run a multi-turn analysis session with bounded memory."""
    history = []
    for i, question in enumerate(questions, 1):
        result = await agent.run(
            question,
            deps=deps,
            message_history=sliding_window(history, max_history),
        )
        history = result.all_messages()
        print(f"Turn {i}: {question}")
        print(f"  → {result.output}\n")
    return history


questions = [
    "What was Q4 revenue for Enterprise Suite?",
    "How does that compare to Q3?",
    "Which region contributed more to Enterprise Suite in Q4?",
    "Summarize everything we've discussed about Enterprise Suite.",
]

final_history = await analysis_session(sales_agent, deps, questions)
print(f"Final history: {len(final_history)} messages")

Exercise 12.8: Evaluate Your Agent¶

Exercise 12.8: Agent Evaluation Suite

Write a pydantic-evals evaluation suite to test your agent’s behavior — connecting back to the evaluation patterns from Week 11.

Requirements¶

Create a Dataset with at least 4 test Cases covering:
- A straightforward query the agent should handle correctly
- An invalid input that should trigger ModelRetry (e.g., “Q5” or “Widget Ultra”)
- An ambiguous question that requires the agent to make a reasonable interpretation
- A multi-step question that needs multiple tool calls
Mix evaluator types:
- At least one deterministic evaluator (e.g., Contains to check for specific numbers, IsInstance for type checking)
- At least one LLMJudge evaluator with a rubric (e.g., “The response should cite specific revenue figures and not make up data”)
Run the evaluation and print the report. Analyze: which cases pass? Which fail? Why?

Starter Code¶

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Contains, IsInstance, LLMJudge


async def run_agent(query: str) -> str:
    """Wrapper that runs the sales agent and returns the output string."""
    result = await sales_agent.run(query, deps=deps)
    return result.output


dataset = Dataset(
    evaluators=[
        IsInstance(type=str),  # all outputs should be strings
    ],
    cases=[
        Case(
            name="simple_lookup",
            inputs="What was total Q4 revenue?",
            evaluators=[
                # TODO: add a Contains evaluator for an expected figure
                # TODO: add an LLMJudge evaluator
            ],
        ),
        Case(
            name="invalid_quarter",
            inputs="What was Q5 revenue?",
            evaluators=[
                # The agent should handle this gracefully (ModelRetry internally)
                # and respond helpfully — not crash or hallucinate
                # TODO: add an LLMJudge evaluator checking for graceful handling
            ],
        ),
        # TODO: Add at least 2 more cases
    ],
)

report = await dataset.evaluate(run_agent)
report.print(include_input=True, include_output=True, include_durations=False)

Analysis Questions¶

After running the evaluation, answer in a markdown cell:

Which cases passed and which failed? Were any results surprising?
Did the LLMJudge evaluator agree with your intuitive assessment of the agent’s responses?
What additional test cases would you add to improve coverage?
How would you test the agent’s multi-turn behavior (not just single queries)? What challenges does that present for evaluation?

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Building an agent is composing the pieces: Agent + RunContext + tools + ModelRetry + message_history. Each piece is simple; the power comes from combining them.
Tool design matters — clear docstrings, validated inputs with ModelRetry, and appropriate granularity (not too broad, not too narrow) determine how well the agent performs.
Memory is explicit in PydanticAI — you choose what history to carry forward, and you can filter, window, or transform it however you need.
Evaluation connects back to Week 11 — the same pydantic-evals patterns (Cases, Datasets, Evaluators, LLMJudge) work for testing agents, not just LLM outputs.
Inspect new_messages() after every run during development — seeing the actual tool calls and responses is the fastest way to debug agent behavior.
Start simple, iterate — get a basic agent working with one tool, then add more tools, then add memory, then add evaluation. Don’t try to build everything at once.

What’s Next¶

In Week 13, we’ll scale up from single agents to multi-agent systems:

Agent-as-tool delegation — one agent calling another agent inside a tool, sharing dependencies and usage tracking
Multi-agent communication patterns — hub-and-spoke, pipeline, debate/consensus
The agentic framework landscape — how PydanticAI compares to LangGraph, AutoGen, CrewAI (conceptual survey)
Orchestration patterns — building specialist agents that hand off to each other