Lab — Orchestration Workshop

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

Orchestrator-workers and evaluator-optimizer patterns (L13.01)
Pydantic Logfire tracing and human-in-the-loop safety controls (L13.02)
PydanticAI agent-as-tool delegation and ModelRetry (L12.02)
pydantic-evals — Cases, Datasets, and LLMJudge evaluators (L11.02)

Outcomes

Build a multi-agent workflow in PydanticAI that combines orchestrator-workers and evaluator-optimizer patterns in one system
Instrument the workflow with Pydantic Logfire from the start and use the resulting span tree to verify correct orchestration
Implement a human-in-the-loop checkpoint on a write-capable tool and configure least-privilege tool access
Write span-based evaluators with pydantic-evals that verify how the agents worked, not just what they returned

References

Lab Overview¶

In this lab we assemble everything from Week 13 — the Anthropic patterns from L13.01 and the infrastructure from L13.02 — into one production-style multi-agent workflow. You’ll build a Research Briefing Generator that takes a topic and produces a short, evidence-grounded briefing ready for human review.

The workflow combines two named workflow patterns from Anthropic’s taxonomy:

Orchestrator-workers — a lead agent decomposes the topic into sub-questions, spawns parallel research subagents (via the agent-as-tool pattern), and synthesizes the findings.
Evaluator-optimizer — an evaluator agent scores the draft briefing; if it fails, the synthesizer gets feedback and revises. Loop until it passes (or we hit a budget).

On top of that we add two technical safety controls from L13.02: a HITL checkpoint on the publish step, and least-privilege destination gating.

Figure 1:The full Research Briefing Generator. Topic flows in; a planner decomposes it; parallel research subagents answer sub-questions; the synthesizer drafts a briefing; an evaluator-optimizer loop refines it; a HITL-gated publish step persists the result. Each dashed box is one PydanticAI agent.

Part A below is fully built for you — read it, run it, and verify your trace looks sensible. Parts B–D are exercises you’ll extend the system with. Part E is a short written reflection that ties the workflow back to the two taxonomies from L13.01.

Setup¶

Standard model setup — same LiteLLM proxy pattern we’ve used since Week 8.

import asyncio
import os
from dataclasses import dataclass, field
from typing import Literal

from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, ModelRetry, RunContext
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Logfire from the start¶

The single most impactful thing you will do in this lab is instrument it with Logfire before you write any agent code. Once this is done, every agent.run() produces a span tree automatically.

We’ll use send_to_logfire=False so nothing goes to the cloud — console output is enough for the lab. If you want the web UI (recommended for the written assignment), see the stretch exercise at the end.

import logfire

logfire.configure(send_to_logfire=False)
logfire.instrument_pydantic_ai()

A seeded “research database”¶

To keep this lab deterministic and reproducible, we’ll use a tiny in-memory fact base instead of real web search. Real research agents would call an external search tool; the orchestration patterns we’re building are the same either way.

# Seeded fact base — topic keyword → list of bullet-point facts with "source".
FACT_DB = {
    "rag": [
        ("Lewis et al. 2020 introduced RAG at NeurIPS, combining DPR retrieval with BART generation.", "Lewis 2020"),
        ("Modern RAG uses hybrid search (BM25 + dense) and a cross-encoder reranker — 'retrieve wide, rerank narrow'.", "HF course Ch. 5"),
        ("Chunking strategy accounts for ~80% of RAG retrieval failures in production.", "industry survey 2025"),
        ("RAGAS-style metrics score faithfulness, answer relevance, and context precision/recall.", "RAGAS docs"),
    ],
    "agents": [
        ("An agent is an LLM with access to tools, running in a loop.", "Anthropic 2024"),
        ("Orchestrator-workers is 90% faster than single-agent for parallelizable research.", "Anthropic 2025"),
        ("Claude Code uses a single main thread with subagents depth ≤ 1 for tightly-coupled coding tasks.", "MinusX 2025"),
        ("Cemri et al. 2025 found multi-agent systems correct only ~25% of the time across 5 frameworks.", "Cemri 2025"),
    ],
    "evaluation": [
        ("BLEU measures n-gram precision; ROUGE measures n-gram recall; BERTScore measures semantic similarity.", "J&M Ch. 12"),
        ("pydantic-evals ships LLMJudge for rubric-based evaluation with score vs. assertion modes.", "PydanticAI docs"),
        ("Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.", "Strathern 1997"),
    ],
}


def lookup_facts(topic_keyword: str) -> list[tuple[str, str]]:
    """Return seeded facts for a topic keyword, or [] if unknown."""
    return FACT_DB.get(topic_keyword.lower().strip(), [])


# Quick sanity check
print(f"'rag' facts: {len(lookup_facts('rag'))}")
print(f"'agents' facts: {len(lookup_facts('agents'))}")
print(f"'unicorns' facts: {len(lookup_facts('unicorns'))}")

'rag' facts: 4
'agents' facts: 4
'unicorns' facts: 0

Dependencies¶

Everything external the agents need flows through a single BriefingDeps dataclass — same dependency injection pattern as Week 12.

@dataclass
class BriefingDeps:
    """External state for the briefing workflow."""
    fact_db: dict = field(default_factory=lambda: FACT_DB)
    allowed_destinations: tuple[str, ...] = ("team-channel", "archive")
    max_revisions: int = 3
    quality_threshold: float = 0.75

Part A: Built-In Scaffolding — Orchestrator + Parallel Research Subagents¶

Part A is provided — read it carefully, run each cell, and make sure the trace output makes sense. Everything in this section is your starting point for Parts B–D.

The research subagent¶

One PydanticAI agent. One tool: search_facts. It answers one focused sub-question. Because the subagent is an agent, it has its own loop — but the loop is small.

class ResearchFinding(BaseModel):
    """One evidence-grounded answer to a sub-question."""
    sub_question: str
    answer: str = Field(description="2-3 sentence answer grounded in the retrieved facts.")
    citations: list[str] = Field(default_factory=list)


research_subagent = Agent(
    get_model("claude-haiku-4-5"),
    deps_type=BriefingDeps,
    output_type=ResearchFinding,
    system_prompt=(
        "You are a research specialist. Answer ONE focused sub-question by calling "
        "search_facts for the relevant topic keyword. Ground every claim in a cited fact. "
        "If no facts are returned, say so honestly."
    ),
)


@research_subagent.tool
def search_facts(ctx: RunContext[BriefingDeps], topic_keyword: str) -> str:
    """Look up evidence facts for a topic keyword.

    topic_keyword should be a single lowercase word like 'rag', 'agents', 'evaluation'.
    """
    facts = ctx.deps.fact_db.get(topic_keyword.lower().strip(), [])
    if not facts:
        raise ModelRetry(
            f"No facts found for '{topic_keyword}'. "
            f"Available keywords: {list(ctx.deps.fact_db.keys())}"
        )
    return "\n".join(f"- {fact} (source: {src})" for fact, src in facts)

The planner¶

The planner is a lightweight agent whose job is to split the user’s topic into focused sub-questions. It doesn’t do research itself — it just decomposes.

class ResearchPlan(BaseModel):
    """Decomposition of a topic into sub-questions."""
    topic: str
    sub_questions: list[str] = Field(
        min_length=2,
        max_length=5,
        description="3-5 focused questions that together cover the topic.",
    )


planner = Agent(
    get_model("claude-haiku-4-5"),
    output_type=ResearchPlan,
    system_prompt=(
        "You plan research on a topic by decomposing it into 3-5 focused sub-questions. "
        "Each sub-question should be answerable by looking up one topic keyword from this list: "
        "rag, agents, evaluation. Keep sub-questions specific and non-overlapping."
    ),
)

The synthesizer¶

The synthesizer agent combines multiple ResearchFindings into a draft briefing. For Part A it takes no feedback; we’ll add that in Part B.

class Briefing(BaseModel):
    """The output artifact of the workflow."""
    topic: str
    summary: str = Field(description="3-5 sentence executive summary.")
    key_findings: list[str] = Field(description="Bullet-point key findings with citations inline.")
    open_questions: list[str] = Field(description="What the briefing does NOT answer.")


synthesizer = Agent(
    get_model("claude-haiku-4-5"),
    output_type=Briefing,
    system_prompt=(
        "You are a technical writer. Combine a set of research findings into a clean briefing "
        "with a short summary, 3-5 key findings (each citing a source), and a list of open questions. "
        "Do NOT introduce claims not present in the findings."
    ),
)

Putting it together: the orchestrator¶

The orchestrator is regular Python, not an agent. It calls the planner, runs the research subagents in parallel via asyncio.gather, and hands the findings to the synthesizer.

This is the orchestrator-workers pattern from Anthropic’s taxonomy: code orchestrates the flow; the agents do the work at each step.

async def research_one(sub_q: str, deps: BriefingDeps) -> ResearchFinding:
    """Run the research subagent on one sub-question."""
    result = await research_subagent.run(sub_q, deps=deps)
    return result.output


async def generate_briefing_v1(topic: str, deps: BriefingDeps) -> Briefing:
    """Version 1: plan → research in parallel → synthesize. No evaluator yet."""
    # Step 1: plan
    plan_result = await planner.run(f"Topic: {topic}")
    plan = plan_result.output
    print(f"Plan: {len(plan.sub_questions)} sub-questions")

    # Step 2: parallel research (orchestrator-workers pattern)
    findings = await asyncio.gather(
        *[research_one(sq, deps) for sq in plan.sub_questions]
    )

    # Step 3: synthesize
    findings_text = "\n\n".join(
        f"Q: {f.sub_question}\nA: {f.answer}\nCitations: {', '.join(f.citations)}"
        for f in findings
    )
    synth_result = await synthesizer.run(
        f"Topic: {topic}\n\nFindings:\n{findings_text}"
    )
    return synth_result.output

Let’s run it. Open your Logfire console output as this runs — you should see a tree with the planner call, three to five parallel research subagent runs, and then the synthesizer.

deps = BriefingDeps()

briefing = await generate_briefing_v1("the current state of RAG and agents", deps)

print(f"\nTopic: {briefing.topic}")
print(f"\nSummary:\n  {briefing.summary}")
print(f"\nKey findings:")
for f in briefing.key_findings:
    print(f"  - {f}")
print(f"\nOpen questions:")
for q in briefing.open_questions:
    print(f"  - {q}")

13:10:06.458 planner run
13:10:06.465   chat claude-haiku-4-5

Plan: 3 sub-questions
13:10:10.217 research_subagent run
13:10:10.218 research_subagent run
13:10:10.219 research_subagent run
             research_subagent run
13:10:10.220   chat claude-haiku-4-5
             research_subagent run
13:10:10.222   chat claude-haiku-4-5
             research_subagent run
13:10:10.223   chat claude-haiku-4-5

             research_subagent run
13:10:11.581   running 1 tool
13:10:11.582     running tool: search_facts
13:10:11.584   chat claude-haiku-4-5

             research_subagent run
13:10:11.901   running 2 tools
13:10:11.901     running tool: search_facts
13:10:11.902     running tool: search_facts
13:10:11.906   chat claude-haiku-4-5

             research_subagent run
13:10:12.130   running 3 tools
13:10:12.130     running tool: search_facts
13:10:12.130     running tool: search_facts
13:10:12.130     running tool: search_facts
13:10:12.133   chat claude-haiku-4-5

13:10:15.620 synthesizer run
13:10:15.622   chat claude-haiku-4-5


Topic: The Current State of RAG and Agents

Summary:
  Retrieval-Augmented Generation (RAG) has advanced toward hybrid retrieval strategies combining lexical and dense vector search with reranking, while chunking strategy has emerged as critical—accounting for ~80% of retrieval failures in production systems. Agents are being implemented as "LLMs with access to tools, running in a loop," with two dominant patterns: orchestrator-worker designs (90% faster for parallelizable tasks) and single-thread approaches with limited subagent depth. Evaluation remains a key challenge, with standardized RAG metrics (RAGAS) gaining adoption, but agent reliability assessment still fragmented—multi-agent systems correct only ~25% of errors across frameworks. Both domains emphasize the importance of thoughtful metric selection to avoid optimizing the wrong targets.

Key findings:
  - Modern RAG systems employ hybrid retrieval combining BM25 lexical matching with dense vector retrieval, followed by cross-encoder reranking in a 'retrieve wide, rerank narrow' approach that improves accuracy (HF course Ch. 5, industry survey 2025).
  - Chunking strategy is critical for production RAG systems, accounting for approximately 80% of retrieval failures; standardized evaluation metrics like RAGAS have emerged to measure faithfulness, answer relevance, and context precision/recall (HF course Ch. 5, RAGAS docs).
  - Agents are implemented as 'LLMs with access to tools, running in a loop' using two primary patterns: orchestrator-worker designs (achieving 90% faster performance for parallelizable tasks) and single-thread approaches with subagent depth ≤1 for tightly-coupled tasks (Anthropic 2024, Anthropic 2025, MinusX 2025).
  - Multi-agent system reliability is a significant challenge, with recent findings showing these systems correct only ~25% of the time across five frameworks, indicating that error detection and correction remain unsolved problems (Cemri 2025).
  - Evaluation approaches for RAG and agents employ semantic similarity metrics (BERTScore), rubric-based evaluation with LLMJudge, and RAGAS-style metrics, but metric selection must account for Goodhart's Law—when a measure becomes a target, it ceases to be a good measure (RAGAS docs, pydantic-evals docs, Strathern 1997).

Open questions:
  - How do hybrid retrieval strategies perform across different domain types (e.g., biomedical, financial, legal) compared to domain-specific optimizations?
  - What specific chunking strategies are most effective for different data modalities (structured, unstructured, semi-structured) and document types?
  - Beyond the ~25% error correction rate, what causes multi-agent systems to fail at correction, and are there architectural changes that could improve reliability?
  - How do orchestrator-worker and single-thread agent designs compare in terms of accuracy, latency, and cost across different task complexity levels?
  - What standardized evaluation benchmarks exist (if any) for agents, and how do evaluation metrics differ between agentic reasoning and traditional language model tasks?
  - How should organizations balance pursuit of quantifiable metrics (RAGAS, BERTScore) against qualitative evaluation to avoid Goodhart's Law pitfalls in production RAG/agent systems?

Scroll back through the console output. Your span tree will look something like this (real timestamps from one run; Logfire’s continuation-lines trimmed for clarity):

00:15:09.318  planner run
00:15:09.322    chat claude-haiku-4-5
              Plan: 4 sub-questions
00:15:11.534  research_subagent run   ┐
00:15:11.535  research_subagent run   │  four subagents
00:15:11.535  research_subagent run   │  start within ~2 ms
00:15:11.535  research_subagent run   ┘  of each other
00:15:11.536    chat claude-haiku-4-5
00:15:11.537    chat claude-haiku-4-5
00:15:11.538    chat claude-haiku-4-5
00:15:11.539    chat claude-haiku-4-5
00:15:12.441    running 1 tool
00:15:12.442      running tool: search_facts
00:15:12.446    chat claude-haiku-4-5
00:15:12.641    running 2 tools
00:15:12.642      running tool: search_facts
00:15:12.643      running tool: search_facts
00:15:12.648    chat claude-haiku-4-5
00:15:12.745    running 3 tools
00:15:12.745      running tool: search_facts
00:15:12.745      running tool: search_facts
00:15:12.745      running tool: search_facts
00:15:12.747    chat claude-haiku-4-5
00:15:12.777    running 1 tool
00:15:12.777      running tool: search_facts
00:15:12.778    chat claude-haiku-4-5
00:15:15.278  synthesizer run
00:15:15.280    chat claude-haiku-4-5

Two things worth calling out in this trace:

The four research_subagent run spans all start within ~2 ms of each other. Each then independently does an LLM call and some number of search_facts tool calls. That’s orchestrator-workers parallelism showing up in a production trace: the entire research phase finishes in ~4 seconds (09 → 15) instead of the ~16 it would take if we ran the subagents serially.
Each subagent decided for itself how many search_facts calls it needed — 1, 2, 3, and 1, respectively, in this run. That’s the “agent” behavior inside each worker: the LLM looked at its sub-question and picked its own search strategy. So we have agents inside the workers and a workflow (plain Python) across them — exactly the composition pattern from L13.01.

Part B (Exercise 13.7): Add the Evaluator-Optimizer Loop¶

The v1 workflow always returns whatever the synthesizer produces on the first pass — even if the briefing is thin, unfaithful to the findings, or misses parts of the topic. We’ll fix that with Anthropic’s evaluator-optimizer pattern: a second agent critiques the draft and the synthesizer revises if needed.

Figure 2:The evaluator-optimizer loop. Synthesizer produces a draft, evaluator scores it on three dimensions, and either accepts it or sends feedback for a revision. Loop exits on pass or budget exhaustion.

Exercise 13.7: Build the evaluator-optimizer loop

Requirements¶

Build an evaluator agent (evaluator) that scores a draft briefing on three dimensions and gives actionable feedback. Use a structured BriefingCritique output type with fields:

class BriefingCritique(BaseModel):
    faithfulness_score: float = Field(ge=0.0, le=1.0, description="Fraction of claims grounded in the findings.")
    comprehensiveness_score: float = Field(ge=0.0, le=1.0, description="Coverage of the original sub-questions.")
    clarity_score: float = Field(ge=0.0, le=1.0, description="Is the summary understandable to a busy reader?")
    feedback: str = Field(description="Concrete revision guidance, or '' if accepted.")
    accepted: bool

Wire the loop into generate_briefing_v2 so the synthesizer revises based on feedback up to deps.max_revisions times (default 3). Exit when either:
- The evaluator returns accepted=True AND all three scores ≥ deps.quality_threshold (default 0.75), OR
- You’ve hit the revision budget (return the best draft you’ve seen).
Verify in your Logfire trace that the loop runs at least twice on at least one test topic. (If the first pass always passes, temporarily lower quality_threshold to force a revision so you can see the loop.)

Starter code¶

class BriefingCritique(BaseModel):
    faithfulness_score: float = Field(ge=0.0, le=1.0)
    comprehensiveness_score: float = Field(ge=0.0, le=1.0)
    clarity_score: float = Field(ge=0.0, le=1.0)
    feedback: str
    accepted: bool


evaluator = Agent(
    get_model("claude-haiku-4-5"),
    output_type=BriefingCritique,
    system_prompt=(
        "You evaluate draft research briefings against the original findings. "
        "Score faithfulness, comprehensiveness, and clarity from 0.0 to 1.0. "
        "If all scores >= 0.75, set accepted=true and leave feedback empty. "
        "Otherwise give concrete, actionable revision guidance."
    ),
)


async def generate_briefing_v2(topic: str, deps: BriefingDeps) -> Briefing:
    # TODO: plan + parallel research (same as v1)

    # TODO: loop up to deps.max_revisions:
    #   - synthesizer.run(findings + optional prior feedback)
    #   - evaluator.run(draft + findings) → BriefingCritique
    #   - if accepted and all scores >= threshold: return
    #   - else: keep best-scored draft and continue
    ...

Design questions to answer in comments¶

How do you pass the prior critique into the synthesizer on the revision pass? (Hint: extend the prompt, or use message_history.)
What’s your tie-breaker if two revisions both fail the threshold?
If the evaluator itself is wrong (e.g., too strict), how would you detect that?

Part C (Exercise 13.8): HITL + Least-Privilege Publish¶

The briefing now exists as a Python object. In real deployment you’d persist or publish it somewhere — a Slack channel, an internal wiki, a notifications feed. That’s a write operation, and writes deserve approval. We add a publish_briefing tool with two safety controls from L13.02: a HITL checkpoint and least-privilege destination validation.

Exercise 13.8: Add a HITL-gated publish tool

Requirements¶

Create a publisher agent with a publish_briefing(destination) tool. The tool should:
- Validate destination against deps.allowed_destinations — raise ModelRetry on invalid destinations with the list of valid ones (this is your least-privilege allow-list).
- Be decorated with @agent.tool and have requires_approval=True in its configuration (check PydanticAI’s tool approval docs for the current API).
- On approval, print a receipt line that clearly shows what was published where. (Production would actually call the API.)
Handle the approval callback — in a notebook, the simplest approach is to print the pending request, read input() for y/n, and set the approval accordingly. Make the default deny if no input comes in.
Write a publish_workflow(topic, destination) that: runs generate_briefing_v2, then asks the publisher agent to publish the result to destination.
Verify in your Logfire trace that:
- The publish attempt appears as a tool call with requires_approval.
- The approval event happens before the tool body runs.
- Rejecting an invalid destination raises the expected ModelRetry.

Starter code¶

publisher = Agent(
    get_model("claude-haiku-4-5"),
    deps_type=BriefingDeps,
    system_prompt="You publish completed briefings. Call publish_briefing exactly once.",
)


@publisher.tool  # add the approval flag
def publish_briefing(
    ctx: RunContext[BriefingDeps],
    destination: str,
    briefing_summary: str,
) -> str:
    """Publish a briefing summary to an allowed destination."""
    # TODO: validate destination against ctx.deps.allowed_destinations; ModelRetry on miss
    # TODO: return a receipt string
    ...


async def publish_workflow(topic: str, destination: str, deps: BriefingDeps) -> str:
    briefing = await generate_briefing_v2(topic, deps)
    # TODO: ask publisher to publish briefing.summary to destination
    ...

A test to run¶

One call to publish_workflow(..., destination="team-channel") — should trigger the approval prompt, proceed on y.
One call to publish_workflow(..., destination="public-twitter") — should raise ModelRetry and never reach the approval step.

Part D (Exercise 13.9): Span-Based Evaluation¶

In L11.02 you used pydantic-evals to score LLM outputs with LLMJudge. That’s about what the agent returned. For multi-agent systems, you often care equally about how the agent worked: did it actually decompose the topic? Did it actually call the research subagents? Did the approval step really fire before the publish?

That’s what span-based evaluation is for. You inspect the trace (or the messages / tool-call history) of the run and assert that the expected pattern of delegation happened.

Exercise 13.9: Write span-based evaluators

Requirements¶

Build a pydantic-evals Dataset with 3–5 Cases, each a different input topic. The task function should run your full publish_workflow (mock the approval to always accept during evaluation).
Write at least three span-based evaluators that inspect either the result object or the agent’s message history:
- Planner fired: the planner produced at least 2 sub-questions for the topic.
- Parallel research happened: at least 2 research subagent runs completed successfully.
- Evaluator-optimizer converged: the revision loop ran at least once and produced a final accepted briefing within the budget.
- (Bonus) Approval preceded publish: in the tool-call trace for the publisher, the approval event comes before the publish body executes.
Each evaluator should return either a score (0–1), a pass/fail assertion, or a structured EvaluatorOutput per the pydantic-evals API.
Run dataset.evaluate() and inspect the report. Include a screenshot or text dump of the report in your submission.

Starter structure¶

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

# Cases
cases = [
    Case(name="rag_topic", inputs={"topic": "RAG for production", "destination": "team-channel"}),
    Case(name="agents_topic", inputs={"topic": "building reliable agents", "destination": "archive"}),
    Case(name="eval_topic", inputs={"topic": "evaluating LLM systems", "destination": "team-channel"}),
]

# Task
async def task(inputs):
    return await publish_workflow_for_eval(inputs["topic"], inputs["destination"])


# TODO: implement at least three span-based evaluators
# Hint: your publish_workflow can return a small "trace summary" dict
# (e.g., {"n_subquestions": 4, "n_revisions": 2, "publish_destination": "team-channel"})
# that makes evaluators easy to write.

dataset = Dataset(cases=cases, evaluators=[...])
report = await dataset.evaluate(task)
print(report)

Part E (Exercise 13.10): Reflection Writeup¶

Exercise 13.10: Architectural reflection (assignment deliverable)

Write 1–2 pages addressing the following, in your own voice. This is the second deliverable for Week 13 per the master plan.

Pattern mapping. Your completed workflow implements two Anthropic workflow patterns by construction (orchestrator-workers and evaluator-optimizer). Are any of the others present too? Which of Andrew Ng’s four patterns (Reflection, Tool Use, Planning, Multi-Agent) show up, and where?
Alternative implementations. Describe how this same workflow would look if you built it in LangGraph or CrewAI. For whichever framework you pick, name two trade-offs versus PydanticAI — one in favor of the alternative, one against.
Failure mode analysis. From the Cemri et al. 2025 taxonomy (L13.01), identify two failure modes your design specifically guards against. For each, cite the exact design element that mitigates it (a specific tool validator, evaluator score, HITL gate, etc.).
What you’d change. If you were shipping this to production, name one thing about your design you’d want to change, and why.

Stretch Exercise: Logfire Cloud UI¶

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Real agentic systems compose patterns. Our briefing generator uses orchestrator-workers (for parallel research) and evaluator-optimizer (for revision). Neither alone would do the job.
The orchestrator is regular Python. Agents do the work at each step; code coordinates them. This is the workflow vs. agent distinction from L13.01 applied in practice.
Instrument first, build second. Three lines of Logfire at the top of the lab gave you a span tree for every run, with zero per-agent code. Every agent system you build in the future should start this way.
Guard every write with HITL + least-privilege. The publish_briefing tool combines both: an allow-list of destinations (least-privilege) and a human approval gate (HITL). Both matter.
Span-based evals verify the shape of the run, not just the answer. For multi-agent systems, “did the agent actually delegate?” is often a more useful question than “is the final output correct?”
The patterns from L13.01 are not abstract. You’ve now built a system whose parts you can point at and name: “this is orchestrator-workers, this is evaluator-optimizer, this is agent-as-tool.” Carry that naming habit into your own work.

What’s Next¶

Week 13 wraps here. Next week we turn from the engineering safety questions we’ve touched on today (prompt injection, least-privilege, HITL) to the broader societal questions — bias, privacy, memorization, responsible deployment — and close with project presentations. The orchestration skills you’ve built this week are what make those ethical questions actionable: you can’t responsibly deploy what you can’t observe, evaluate, or gate.