Lab — Orchestration Workshop
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
Orchestrator-workers and evaluator-optimizer patterns (L13.01)
Pydantic Logfire tracing and human-in-the-loop safety controls (L13.02)
PydanticAI agent-as-tool delegation and
ModelRetry(L12.02)pydantic-evals— Cases, Datasets, and LLMJudge evaluators (L11.02)
Outcomes
Build a multi-agent workflow in PydanticAI that combines orchestrator-workers and evaluator-optimizer patterns in one system
Instrument the workflow with Pydantic Logfire from the start and use the resulting span tree to verify correct orchestration
Implement a human-in-the-loop checkpoint on a write-capable tool and configure least-privilege tool access
Write span-based evaluators with
pydantic-evalsthat verify how the agents worked, not just what they returned
References
Lab Overview¶
In this lab we assemble everything from Week 13 — the Anthropic patterns from L13.01 and the infrastructure from L13.02 — into one production-style multi-agent workflow. You’ll build a Research Briefing Generator that takes a topic and produces a short, evidence-grounded briefing ready for human review.
The workflow combines two named workflow patterns from Anthropic’s taxonomy:
Orchestrator-workers — a lead agent decomposes the topic into sub-questions, spawns parallel research subagents (via the agent-as-tool pattern), and synthesizes the findings.
Evaluator-optimizer — an evaluator agent scores the draft briefing; if it fails, the synthesizer gets feedback and revises. Loop until it passes (or we hit a budget).
On top of that we add two technical safety controls from L13.02: a HITL checkpoint on the publish step, and least-privilege destination gating.
Figure 1:The full Research Briefing Generator. Topic flows in; a planner decomposes it; parallel research subagents answer sub-questions; the synthesizer drafts a briefing; an evaluator-optimizer loop refines it; a HITL-gated publish step persists the result. Each dashed box is one PydanticAI agent.
Part A below is fully built for you — read it, run it, and verify your trace looks sensible. Parts B–D are exercises you’ll extend the system with. Part E is a short written reflection that ties the workflow back to the two taxonomies from L13.01.
Setup¶
Standard model setup — same LiteLLM proxy pattern we’ve used since Week 8.
import asyncio
import os
from dataclasses import dataclass, field
from typing import Literal
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, ModelRetry, RunContext
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
load_dotenv()
PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"
def get_model(model_name: str) -> OpenAIChatModel:
"""Create a model connection through our LiteLLM proxy."""
return OpenAIChatModel(
model_name,
provider=OpenAIProvider(
base_url=PROXY_URL,
api_key=os.environ["CAP6640_API_KEY"],
),
)Logfire from the start¶
The single most impactful thing you will do in this lab is instrument it with Logfire before you write any agent code. Once this is done, every agent.run() produces a span tree automatically.
We’ll use send_to_logfire=False so nothing goes to the cloud — console output is enough for the lab. If you want the web UI (recommended for the written assignment), see the stretch exercise at the end.
import logfire
logfire.configure(send_to_logfire=False)
logfire.instrument_pydantic_ai()A seeded “research database”¶
To keep this lab deterministic and reproducible, we’ll use a tiny in-memory fact base instead of real web search. Real research agents would call an external search tool; the orchestration patterns we’re building are the same either way.
# Seeded fact base — topic keyword → list of bullet-point facts with "source".
FACT_DB = {
"rag": [
("Lewis et al. 2020 introduced RAG at NeurIPS, combining DPR retrieval with BART generation.", "Lewis 2020"),
("Modern RAG uses hybrid search (BM25 + dense) and a cross-encoder reranker — 'retrieve wide, rerank narrow'.", "HF course Ch. 5"),
("Chunking strategy accounts for ~80% of RAG retrieval failures in production.", "industry survey 2025"),
("RAGAS-style metrics score faithfulness, answer relevance, and context precision/recall.", "RAGAS docs"),
],
"agents": [
("An agent is an LLM with access to tools, running in a loop.", "Anthropic 2024"),
("Orchestrator-workers is 90% faster than single-agent for parallelizable research.", "Anthropic 2025"),
("Claude Code uses a single main thread with subagents depth ≤ 1 for tightly-coupled coding tasks.", "MinusX 2025"),
("Cemri et al. 2025 found multi-agent systems correct only ~25% of the time across 5 frameworks.", "Cemri 2025"),
],
"evaluation": [
("BLEU measures n-gram precision; ROUGE measures n-gram recall; BERTScore measures semantic similarity.", "J&M Ch. 12"),
("pydantic-evals ships LLMJudge for rubric-based evaluation with score vs. assertion modes.", "PydanticAI docs"),
("Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.", "Strathern 1997"),
],
}
def lookup_facts(topic_keyword: str) -> list[tuple[str, str]]:
"""Return seeded facts for a topic keyword, or [] if unknown."""
return FACT_DB.get(topic_keyword.lower().strip(), [])
# Quick sanity check
print(f"'rag' facts: {len(lookup_facts('rag'))}")
print(f"'agents' facts: {len(lookup_facts('agents'))}")
print(f"'unicorns' facts: {len(lookup_facts('unicorns'))}")'rag' facts: 4
'agents' facts: 4
'unicorns' facts: 0
Dependencies¶
Everything external the agents need flows through a single BriefingDeps dataclass — same dependency injection pattern as Week 12.
@dataclass
class BriefingDeps:
"""External state for the briefing workflow."""
fact_db: dict = field(default_factory=lambda: FACT_DB)
allowed_destinations: tuple[str, ...] = ("team-channel", "archive")
max_revisions: int = 3
quality_threshold: float = 0.75Part A: Built-In Scaffolding — Orchestrator + Parallel Research Subagents¶
Part A is provided — read it carefully, run each cell, and make sure the trace output makes sense. Everything in this section is your starting point for Parts B–D.
The research subagent¶
One PydanticAI agent. One tool: search_facts. It answers one focused sub-question. Because the subagent is an agent, it has its own loop — but the loop is small.
class ResearchFinding(BaseModel):
"""One evidence-grounded answer to a sub-question."""
sub_question: str
answer: str = Field(description="2-3 sentence answer grounded in the retrieved facts.")
citations: list[str] = Field(default_factory=list)
research_subagent = Agent(
get_model("claude-haiku-4-5"),
deps_type=BriefingDeps,
output_type=ResearchFinding,
system_prompt=(
"You are a research specialist. Answer ONE focused sub-question by calling "
"search_facts for the relevant topic keyword. Ground every claim in a cited fact. "
"If no facts are returned, say so honestly."
),
)
@research_subagent.tool
def search_facts(ctx: RunContext[BriefingDeps], topic_keyword: str) -> str:
"""Look up evidence facts for a topic keyword.
topic_keyword should be a single lowercase word like 'rag', 'agents', 'evaluation'.
"""
facts = ctx.deps.fact_db.get(topic_keyword.lower().strip(), [])
if not facts:
raise ModelRetry(
f"No facts found for '{topic_keyword}'. "
f"Available keywords: {list(ctx.deps.fact_db.keys())}"
)
return "\n".join(f"- {fact} (source: {src})" for fact, src in facts)The planner¶
The planner is a lightweight agent whose job is to split the user’s topic into focused sub-questions. It doesn’t do research itself — it just decomposes.
class ResearchPlan(BaseModel):
"""Decomposition of a topic into sub-questions."""
topic: str
sub_questions: list[str] = Field(
min_length=2,
max_length=5,
description="3-5 focused questions that together cover the topic.",
)
planner = Agent(
get_model("claude-haiku-4-5"),
output_type=ResearchPlan,
system_prompt=(
"You plan research on a topic by decomposing it into 3-5 focused sub-questions. "
"Each sub-question should be answerable by looking up one topic keyword from this list: "
"rag, agents, evaluation. Keep sub-questions specific and non-overlapping."
),
)The synthesizer¶
The synthesizer agent combines multiple ResearchFindings into a draft briefing. For Part A it takes no feedback; we’ll add that in Part B.
class Briefing(BaseModel):
"""The output artifact of the workflow."""
topic: str
summary: str = Field(description="3-5 sentence executive summary.")
key_findings: list[str] = Field(description="Bullet-point key findings with citations inline.")
open_questions: list[str] = Field(description="What the briefing does NOT answer.")
synthesizer = Agent(
get_model("claude-haiku-4-5"),
output_type=Briefing,
system_prompt=(
"You are a technical writer. Combine a set of research findings into a clean briefing "
"with a short summary, 3-5 key findings (each citing a source), and a list of open questions. "
"Do NOT introduce claims not present in the findings."
),
)Putting it together: the orchestrator¶
The orchestrator is regular Python, not an agent. It calls the planner, runs the research subagents in parallel via asyncio.gather, and hands the findings to the synthesizer.
This is the orchestrator-workers pattern from Anthropic’s taxonomy: code orchestrates the flow; the agents do the work at each step.
async def research_one(sub_q: str, deps: BriefingDeps) -> ResearchFinding:
"""Run the research subagent on one sub-question."""
result = await research_subagent.run(sub_q, deps=deps)
return result.output
async def generate_briefing_v1(topic: str, deps: BriefingDeps) -> Briefing:
"""Version 1: plan → research in parallel → synthesize. No evaluator yet."""
# Step 1: plan
plan_result = await planner.run(f"Topic: {topic}")
plan = plan_result.output
print(f"Plan: {len(plan.sub_questions)} sub-questions")
# Step 2: parallel research (orchestrator-workers pattern)
findings = await asyncio.gather(
*[research_one(sq, deps) for sq in plan.sub_questions]
)
# Step 3: synthesize
findings_text = "\n\n".join(
f"Q: {f.sub_question}\nA: {f.answer}\nCitations: {', '.join(f.citations)}"
for f in findings
)
synth_result = await synthesizer.run(
f"Topic: {topic}\n\nFindings:\n{findings_text}"
)
return synth_result.outputLet’s run it. Open your Logfire console output as this runs — you should see a tree with the planner call, three to five parallel research subagent runs, and then the synthesizer.
deps = BriefingDeps()
briefing = await generate_briefing_v1("the current state of RAG and agents", deps)
print(f"\nTopic: {briefing.topic}")
print(f"\nSummary:\n {briefing.summary}")
print(f"\nKey findings:")
for f in briefing.key_findings:
print(f" - {f}")
print(f"\nOpen questions:")
for q in briefing.open_questions:
print(f" - {q}")13:10:06.458 planner run
13:10:06.465 chat claude-haiku-4-5
Plan: 3 sub-questions
13:10:10.217 research_subagent run
13:10:10.218 research_subagent run
13:10:10.219 research_subagent run
research_subagent run
13:10:10.220 chat claude-haiku-4-5
research_subagent run
13:10:10.222 chat claude-haiku-4-5
research_subagent run
13:10:10.223 chat claude-haiku-4-5
research_subagent run
13:10:11.581 running 1 tool
13:10:11.582 running tool: search_facts
13:10:11.584 chat claude-haiku-4-5
research_subagent run
13:10:11.901 running 2 tools
13:10:11.901 running tool: search_facts
13:10:11.902 running tool: search_facts
13:10:11.906 chat claude-haiku-4-5
research_subagent run
13:10:12.130 running 3 tools
13:10:12.130 running tool: search_facts
13:10:12.130 running tool: search_facts
13:10:12.130 running tool: search_facts
13:10:12.133 chat claude-haiku-4-5
13:10:15.620 synthesizer run
13:10:15.622 chat claude-haiku-4-5
Topic: The Current State of RAG and Agents
Summary:
Retrieval-Augmented Generation (RAG) has advanced toward hybrid retrieval strategies combining lexical and dense vector search with reranking, while chunking strategy has emerged as critical—accounting for ~80% of retrieval failures in production systems. Agents are being implemented as "LLMs with access to tools, running in a loop," with two dominant patterns: orchestrator-worker designs (90% faster for parallelizable tasks) and single-thread approaches with limited subagent depth. Evaluation remains a key challenge, with standardized RAG metrics (RAGAS) gaining adoption, but agent reliability assessment still fragmented—multi-agent systems correct only ~25% of errors across frameworks. Both domains emphasize the importance of thoughtful metric selection to avoid optimizing the wrong targets.
Key findings:
- Modern RAG systems employ hybrid retrieval combining BM25 lexical matching with dense vector retrieval, followed by cross-encoder reranking in a 'retrieve wide, rerank narrow' approach that improves accuracy (HF course Ch. 5, industry survey 2025).
- Chunking strategy is critical for production RAG systems, accounting for approximately 80% of retrieval failures; standardized evaluation metrics like RAGAS have emerged to measure faithfulness, answer relevance, and context precision/recall (HF course Ch. 5, RAGAS docs).
- Agents are implemented as 'LLMs with access to tools, running in a loop' using two primary patterns: orchestrator-worker designs (achieving 90% faster performance for parallelizable tasks) and single-thread approaches with subagent depth ≤1 for tightly-coupled tasks (Anthropic 2024, Anthropic 2025, MinusX 2025).
- Multi-agent system reliability is a significant challenge, with recent findings showing these systems correct only ~25% of the time across five frameworks, indicating that error detection and correction remain unsolved problems (Cemri 2025).
- Evaluation approaches for RAG and agents employ semantic similarity metrics (BERTScore), rubric-based evaluation with LLMJudge, and RAGAS-style metrics, but metric selection must account for Goodhart's Law—when a measure becomes a target, it ceases to be a good measure (RAGAS docs, pydantic-evals docs, Strathern 1997).
Open questions:
- How do hybrid retrieval strategies perform across different domain types (e.g., biomedical, financial, legal) compared to domain-specific optimizations?
- What specific chunking strategies are most effective for different data modalities (structured, unstructured, semi-structured) and document types?
- Beyond the ~25% error correction rate, what causes multi-agent systems to fail at correction, and are there architectural changes that could improve reliability?
- How do orchestrator-worker and single-thread agent designs compare in terms of accuracy, latency, and cost across different task complexity levels?
- What standardized evaluation benchmarks exist (if any) for agents, and how do evaluation metrics differ between agentic reasoning and traditional language model tasks?
- How should organizations balance pursuit of quantifiable metrics (RAGAS, BERTScore) against qualitative evaluation to avoid Goodhart's Law pitfalls in production RAG/agent systems?
Scroll back through the console output. Your span tree will look something like this (real timestamps from one run; Logfire’s continuation-lines trimmed for clarity):
00:15:09.318 planner run
00:15:09.322 chat claude-haiku-4-5
Plan: 4 sub-questions
00:15:11.534 research_subagent run ┐
00:15:11.535 research_subagent run │ four subagents
00:15:11.535 research_subagent run │ start within ~2 ms
00:15:11.535 research_subagent run ┘ of each other
00:15:11.536 chat claude-haiku-4-5
00:15:11.537 chat claude-haiku-4-5
00:15:11.538 chat claude-haiku-4-5
00:15:11.539 chat claude-haiku-4-5
00:15:12.441 running 1 tool
00:15:12.442 running tool: search_facts
00:15:12.446 chat claude-haiku-4-5
00:15:12.641 running 2 tools
00:15:12.642 running tool: search_facts
00:15:12.643 running tool: search_facts
00:15:12.648 chat claude-haiku-4-5
00:15:12.745 running 3 tools
00:15:12.745 running tool: search_facts
00:15:12.745 running tool: search_facts
00:15:12.745 running tool: search_facts
00:15:12.747 chat claude-haiku-4-5
00:15:12.777 running 1 tool
00:15:12.777 running tool: search_facts
00:15:12.778 chat claude-haiku-4-5
00:15:15.278 synthesizer run
00:15:15.280 chat claude-haiku-4-5Two things worth calling out in this trace:
The four
research_subagent runspans all start within ~2 ms of each other. Each then independently does an LLM call and some number ofsearch_factstool calls. That’s orchestrator-workers parallelism showing up in a production trace: the entire research phase finishes in ~4 seconds (09 → 15) instead of the ~16 it would take if we ran the subagents serially.Each subagent decided for itself how many
search_factscalls it needed — 1, 2, 3, and 1, respectively, in this run. That’s the “agent” behavior inside each worker: the LLM looked at its sub-question and picked its own search strategy. So we have agents inside the workers and a workflow (plain Python) across them — exactly the composition pattern from L13.01.
Part B (Exercise 13.7): Add the Evaluator-Optimizer Loop¶
The v1 workflow always returns whatever the synthesizer produces on the first pass — even if the briefing is thin, unfaithful to the findings, or misses parts of the topic. We’ll fix that with Anthropic’s evaluator-optimizer pattern: a second agent critiques the draft and the synthesizer revises if needed.
Figure 2:The evaluator-optimizer loop. Synthesizer produces a draft, evaluator scores it on three dimensions, and either accepts it or sends feedback for a revision. Loop exits on pass or budget exhaustion.
Part C (Exercise 13.8): HITL + Least-Privilege Publish¶
The briefing now exists as a Python object. In real deployment you’d persist or publish it somewhere — a Slack channel, an internal wiki, a notifications feed. That’s a write operation, and writes deserve approval. We add a publish_briefing tool with two safety controls from L13.02: a HITL checkpoint and least-privilege destination validation.
Part D (Exercise 13.9): Span-Based Evaluation¶
In L11.02 you used pydantic-evals to score LLM outputs with LLMJudge. That’s about what the agent returned. For multi-agent systems, you often care equally about how the agent worked: did it actually decompose the topic? Did it actually call the research subagents? Did the approval step really fire before the publish?
That’s what span-based evaluation is for. You inspect the trace (or the messages / tool-call history) of the run and assert that the expected pattern of delegation happened.
Part E (Exercise 13.10): Reflection Writeup¶
Stretch Exercise: Logfire Cloud UI¶
Wrap-Up¶
Key Takeaways¶
What’s Next¶
Week 13 wraps here. Next week we turn from the engineering safety questions we’ve touched on today (prompt injection, least-privilege, HITL) to the broader societal questions — bias, privacy, memorization, responsible deployment — and close with project presentations. The orchestration skills you’ve built this week are what make those ethical questions actionable: you can’t responsibly deploy what you can’t observe, evaluate, or gate.