Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab — API Power Workshop

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Setup

import os
from typing import Literal
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"

def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Part A: End-to-End Extraction Pipeline

In Parts 01 and 02, we learned prompt engineering and structured outputs as separate skills. Now we’ll combine them into a complete extraction pipeline — the kind of thing you’d actually build in a data analytics role.

The pattern is:

  1. Choose a domain — what kind of unstructured text are you processing?

  2. Define a schema — what structured data do you want to extract?

  3. Write a system prompt — use the prompt section library from Part 01

  4. Extract and validate — process multiple inputs and check the results

Choose Your Domain

Pick one of the following domains for your pipeline (or propose your own):

DomainInputWhat to Extract
News articlesArticle text or headline + bodyHeadline, source, date, entities, topic, sentiment, key claims
Restaurant menusMenu text (copied from a website or photo transcript)Restaurant name, cuisine, categories, items with prices and dietary tags
Product reviewsCustomer reviewsProduct, rating, pros, cons, verdict, reviewer sentiment
Job postingsJob listing textTitle, company, location, salary range, skills, experience level
Your choiceAny unstructured text you find interestingYou decide the schema

Guided Walkthrough: Restaurant Menu Parser

We’ll walk through the restaurant menu domain together. You’ll build your own pipeline for your chosen domain in Exercise 9.6.

Step 1: Define the schema

A restaurant menu has natural nesting — a menu has categories (Appetizers, Mains, Desserts), and each category has items with names, descriptions, prices, and dietary info.

class MenuItem(BaseModel):
    """A single item on the menu."""
    name: str = Field(description="Name of the dish")
    description: str | None = Field(description="Description of the dish, if provided")
    price: float | None = Field(description="Price in dollars, or null if not listed")
    dietary_tags: list[Literal[
        "vegetarian", "vegan", "gluten-free", "spicy", "contains-nuts"
    ]] = Field(
        default_factory=list,
        description="Dietary tags that apply to this item, empty list if none"
    )


class MenuCategory(BaseModel):
    """A section of the menu (e.g., Appetizers, Mains)."""
    category_name: str = Field(description="Name of the menu section")
    items: list[MenuItem] = Field(description="Items in this section")


class ParsedMenu(BaseModel):
    """Structured representation of a restaurant menu."""
    restaurant_name: str = Field(description="Name of the restaurant")
    cuisine_type: str = Field(description="Type of cuisine (e.g., Italian, Japanese, American)")
    categories: list[MenuCategory] = Field(description="Menu sections with their items")
    price_range: Literal["$", "$$", "$$$", "$$$$"] = Field(
        description="Overall price range: $ (under 15), $$ (15-30), $$$ (30-60), $$$$ (60+)"
    )

Step 2: Write the system prompt

Using the prompt section library from Part 01, we’ll include: Role, Task, Constraints, and Output Format (via the Pydantic schema).

menu_agent = Agent(
    get_model("claude-sonnet-4-6"),
    output_type=ParsedMenu,
    instructions="""## Role
You are a restaurant menu parser that extracts structured data from menu text.

## Task
Parse the provided menu text and extract all items with their details.

## Constraints
- If a price is not listed, set it to null
- Infer dietary tags only when explicitly stated or obvious from ingredients
  (e.g., "tofu stir-fry" is vegetarian, but don't guess)
- If the restaurant name isn't in the text, use "Unknown Restaurant"
- Categorize items into their menu sections; if no sections exist, use "General"
""",
)

Step 3: Extract from a single input

menu_text_1 = """
SAKURA JAPANESE KITCHEN

🥢 STARTERS
Edamame (v) - Steamed soybeans with sea salt — $6
Gyoza - Pan-fried pork dumplings (6 pcs) — $9
Miso Soup (v, gf) - Traditional soybean soup with tofu and seaweed — $5

🍣 SUSHI ROLLS
California Roll - Crab, avocado, cucumber (8 pcs) — $12
Spicy Tuna Roll - Fresh tuna with spicy mayo, jalapeño (8 pcs) — $14
Dragon Roll - Shrimp tempura, eel, avocado (8 pcs) — $18

🍜 ENTREES
Chicken Teriyaki - Grilled chicken with teriyaki glaze, steamed rice — $16
Pad Thai - Rice noodles, shrimp, peanuts, bean sprouts (contains nuts) — $15
Vegetable Tempura (v) - Assorted vegetables in light batter — $13

🍡 DESSERTS
Mochi Ice Cream (gf) - Green tea, mango, or strawberry (3 pcs) — $7
Matcha Cheesecake — $9
"""

result = await menu_agent.run(menu_text_1)
menu = result.output

print(f"Restaurant: {menu.restaurant_name}")
print(f"Cuisine:    {menu.cuisine_type}")
print(f"Price Range: {menu.price_range}")
print(f"Categories: {len(menu.categories)}")
print()

for category in menu.categories:
    print(f"--- {category.category_name} ---")
    for item in category.items:
        tags = f" [{', '.join(item.dietary_tags)}]" if item.dietary_tags else ""
        price = f"${item.price:.2f}" if item.price else "no price"
        print(f"  {item.name}: {price}{tags}")
    print()
Restaurant: Sakura Japanese Kitchen
Cuisine:    Japanese
Price Range: $$
Categories: 4

--- Starters ---
  Edamame: $6.00 [vegetarian, vegan]
  Gyoza: $9.00
  Miso Soup: $5.00 [vegetarian, gluten-free]

--- Sushi Rolls ---
  California Roll: $12.00
  Spicy Tuna Roll: $14.00 [spicy]
  Dragon Roll: $18.00

--- Entrees ---
  Chicken Teriyaki: $16.00
  Pad Thai: $15.00 [contains-nuts]
  Vegetable Tempura: $13.00 [vegetarian]

--- Desserts ---
  Mochi Ice Cream: $7.00 [gluten-free]
  Matcha Cheesecake: $9.00

Step 4: Batch extraction

A real pipeline processes many inputs. Let’s extract from a second menu and compare:

menu_text_2 = """
The Rustic Table - Farm to Fork American Cuisine

Appetizers
  Crispy Brussels Sprouts with balsamic glaze (v, gf) .... 11
  Loaded Nachos with pulled pork, cheddar, jalapeños .... 14
  Soup of the Day — ask your server .... 8

Burgers & Sandwiches
  Classic Smash Burger - double patty, American cheese, pickles .... 17
  BBQ Pulled Pork Sandwich - house-smoked pork, coleslaw .... 16
  Impossible Burger (v) - plant-based patty, all the fixings .... 18

Sides
  Truffle Fries .... 7
  Mac & Cheese .... 6
  Garden Salad (v, gf) .... 5

Desserts
  Bourbon Pecan Pie (contains nuts) .... 10
  Chocolate Lava Cake .... 12
"""

result_2 = await menu_agent.run(menu_text_2)
menu_2 = result_2.output

# Compare the two menus
print(f"{'':>20} {'Menu 1':>15} {'Menu 2':>15}")
print(f"{'Restaurant':>20} {menu.restaurant_name:>15} {menu_2.restaurant_name:>15}")
print(f"{'Cuisine':>20} {menu.cuisine_type:>15} {menu_2.cuisine_type:>15}")
print(f"{'Price Range':>20} {menu.price_range:>15} {menu_2.price_range:>15}")
print(f"{'Categories':>20} {len(menu.categories):>15} {len(menu_2.categories):>15}")

total_items_1 = sum(len(c.items) for c in menu.categories)
total_items_2 = sum(len(c.items) for c in menu_2.categories)
print(f"{'Total Items':>20} {total_items_1:>15} {total_items_2:>15}")
                              Menu 1          Menu 2
          Restaurant Sakura Japanese Kitchen The Rustic Table
             Cuisine        Japanese        American
         Price Range              $$              $$
          Categories               4               4
         Total Items              11              11

Notice how the same agent handles two completely different menu formats — different price notation ($14 vs .... 14), different dietary tag styles ((v, gf) vs (v)), different section headers. The LLM’s language understanding handles format variation; the Pydantic schema ensures consistent output.


Part B: Adding Tools for Data Enrichment

Your extraction pipeline from Part A produces clean structured data. But what if you want to enrich that data with information the LLM can’t provide on its own?

This is where tool use shines. By giving the agent access to tools, it can augment its extractions with deterministic data — SpaCy NER for precise entity recognition, text statistics, or external lookups.

Walkthrough: Enriching Menu Data with SpaCy

Let’s add a SpaCy tool to our menu parser that identifies named entities in dish descriptions — useful for detecting specific ingredients, place names (e.g., “Thai basil”), or brand references:

import spacy

nlp = spacy.load("en_core_web_sm")


def analyze_menu_text(text: str) -> str:
    """Analyze menu text with SpaCy to extract named entities and key noun phrases.

    Args:
        text: The full menu text to analyze.
    """
    doc = nlp(text)

    entities = [
        {"text": ent.text, "label": ent.label_}
        for ent in doc.ents
    ]

    # Extract noun chunks that might be ingredient or dish references
    noun_phrases = [chunk.text for chunk in doc.noun_chunks][:20]  # limit to top 20

    return (
        f"Named entities: {entities}\n"
        f"Key noun phrases: {noun_phrases}\n"
        f"Token count: {len(doc)}, Sentence count: {len(list(doc.sents))}"
    )


class EnrichedMenu(BaseModel):
    """Menu data enriched with NLP analysis."""
    restaurant_name: str = Field(description="Name of the restaurant")
    cuisine_type: str = Field(description="Type of cuisine")
    total_items: int = Field(description="Total number of menu items")
    avg_price: float | None = Field(description="Average price across all items, null if no prices listed")
    dietary_options: list[str] = Field(description="List of dietary accommodations available (e.g., vegetarian, gluten-free)")
    signature_dishes: list[str] = Field(description="2-3 dishes that seem most distinctive or premium")
    spacy_entities_summary: str = Field(description="Brief summary of what SpaCy found in the text (entities, key ingredients)")


enriched_agent = Agent(
    get_model("claude-sonnet-4-6"),
    output_type=EnrichedMenu,
    tools=[analyze_menu_text],
    instructions="""## Role
You are a restaurant data analyst that combines LLM understanding with NLP tool analysis.

## Task
Analyze the menu text to produce an enriched summary. Use the SpaCy tool to get
precise entity and noun phrase extraction, then combine that with your own
understanding of the menu content.

## Steps
1. First, use the analyze_menu_text tool on the full menu text
2. Then, synthesize the tool results with your own analysis
3. Return the enriched menu summary
""",
)

result = await enriched_agent.run(menu_text_1)
enriched = result.output

print(f"Restaurant:       {enriched.restaurant_name}")
print(f"Cuisine:          {enriched.cuisine_type}")
print(f"Total Items:      {enriched.total_items}")
print(f"Avg Price:        ${enriched.avg_price:.2f}" if enriched.avg_price else "Avg Price: N/A")
print(f"Dietary Options:  {', '.join(enriched.dietary_options)}")
print(f"Signature Dishes: {', '.join(enriched.signature_dishes)}")
print(f"\nSpaCy Analysis:   {enriched.spacy_entities_summary}")
Restaurant:       Sakura Japanese Kitchen
Cuisine:          Japanese
Total Items:      11
Avg Price:        $11.27
Dietary Options:  vegetarian, gluten-free, nut allergen warning
Signature Dishes: Dragon Roll - Shrimp tempura, eel, avocado, Spicy Tuna Roll - Fresh tuna with spicy mayo, jalapeño, Matcha Cheesecake

SpaCy Analysis:   SpaCy detected 'JAPANESE' as a nationality/group entity, confirming cuisine type. Dish names like Gyoza, Miso Soup, Chicken Teriyaki, Pad Thai, Mochi Ice Cream, and Matcha Cheesecake were flagged as PERSON/ORG entities — a common NLP misclassification for proper nouns. Key noun phrases extracted include core ingredients: steamed soybeans, sea salt, tofu, avocado, cucumber, spicy mayo, jalapeño, and fresh tuna. Monetary values ($5–$18) were consistently identified across all 11 items.

The combination is powerful: the LLM understands menu structure and cuisine semantics (it knows “Gyoza” is a Japanese dumpling), while SpaCy provides precise, reproducible entity extraction.


Part C: Designing a Prompt Library

Throughout this week, we’ve been writing prompts one at a time. In practice, you’ll want a prompt library — a collection of reusable, tested prompt templates that your team can share and iterate on.

A good prompt library has three properties:

  1. Parameterized — templates accept variables (domain, output format, constraints) so they adapt to different use cases

  2. Tested — each template has been validated on representative inputs

  3. Composable — templates can be chained together in multi-step workflows

Building a Mini Library

Let’s build three templates and test them:

def summarizer(
    max_sentences: int = 3,
    audience: str = "general",
    focus: str = "key information",
) -> str:
    """Summarization prompt template."""
    return (
        f"Summarize the text in {max_sentences} sentences or fewer. "
        f"Write for a {audience} audience. Focus on {focus}. "
        f"Do not start with 'This text discusses' or similar phrasing."
    )


def entity_extractor(
    entity_types: list[str],
    output_format: str = "list",
) -> str:
    """Entity extraction prompt template."""
    types_str = ", ".join(entity_types)
    return (
        f"Extract all {types_str} entities from the text. "
        f"Return them as a {output_format}. "
        f"Only include entities explicitly mentioned — do not infer or guess. "
        f"If no entities of a type are found, indicate that explicitly."
    )


def classifier(
    categories: list[str],
    description: str = "",
) -> str:
    """Text classification prompt template."""
    cats_str = ", ".join(categories)
    prompt = (
        f"Classify the text into exactly one of these categories: {cats_str}. "
        f"Respond with only the category name."
    )
    if description:
        prompt += f" Context: {description}"
    return prompt


# Test the library on a sample text
sample = (
    "Tesla CEO Elon Musk announced a new Gigafactory in Austin, Texas, "
    "expected to create 10,000 jobs and produce next-generation battery cells. "
    "The $5 billion investment reflects Tesla's push to dominate the EV market "
    "amid growing competition from BYD and Rivian."
)

# Template 1: Summarize
agent = Agent(get_model("claude-haiku-4-5"), instructions=summarizer(max_sentences=2, audience="investor"))
result = await agent.run(sample)
print(f"Summary: {result.output}\n")

# Template 2: Extract entities
agent = Agent(get_model("claude-haiku-4-5"), instructions=entity_extractor(["person", "organization", "location", "money"]))
result = await agent.run(sample)
print(f"Entities: {result.output}\n")

# Template 3: Classify
agent = Agent(get_model("claude-haiku-4-5"), instructions=classifier(
    ["technology", "finance", "politics", "sports", "science"],
    description="News article headlines and excerpts"
))
result = await agent.run(sample)
print(f"Category: {result.output}")
Summary: Tesla is investing $5 billion in a new Austin Gigafactory projected to generate 10,000 jobs and manufacture advanced battery cells, strengthening its competitive position against rivals BYD and Rivian. This expansion signals management's commitment to scaling domestic production capacity and securing supply chain control in the high-growth EV sector.

Entities: # Extracted Entities

**Person:**
- Elon Musk

**Organization:**
- Tesla
- BYD
- Rivian

**Location:**
- Austin, Texas

**Money:**
- $5 billion

Category: technology

Composing Templates: Multi-Step Workflows

The real power comes from chaining templates — using the output of one as the input to the next:

async def analyze_article(text: str) -> dict:
    """Multi-step analysis pipeline: classify → extract → summarize."""
    # Step 1: Classify the article
    classify_agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions=classifier(
            ["technology", "finance", "politics", "sports", "science", "health"]
        ),
    )
    category_result = await classify_agent.run(text)
    category = category_result.output

    # Step 2: Extract entities (customize by category)
    if category.lower() in ["technology", "science"]:
        entity_types = ["person", "organization", "technology", "product"]
    elif category.lower() == "finance":
        entity_types = ["person", "organization", "money", "location"]
    else:
        entity_types = ["person", "organization", "location"]

    extract_agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions=entity_extractor(entity_types),
    )
    entities_result = await extract_agent.run(text)

    # Step 3: Summarize with context
    summary_agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions=summarizer(
            max_sentences=2,
            audience="analyst",
            focus=f"the key {category.lower()} implications",
        ),
    )
    summary_result = await summary_agent.run(text)

    return {
        "category": category,
        "entities": entities_result.output,
        "summary": summary_result.output,
    }


# Run the pipeline
analysis = await analyze_article(sample)

print(f"Category: {analysis['category']}")
print(f"\nEntities:\n{analysis['entities']}")
print(f"\nSummary: {analysis['summary']}")
Category: technology

Entities:
# Extracted Entities

## Person
- Elon Musk

## Organization
- Tesla
- BYD
- Rivian

## Technology
- Battery cells (next-generation)

## Product
- Gigafactory (Austin, Texas location)
- EV (Electric Vehicle)

Summary: Tesla's Austin Gigafactory represents a vertical integration strategy to secure domestic battery supply and reduce manufacturing costs, critical as BYD's cost advantages threaten Tesla's EV market dominance. The 10,000-job facility signals accelerating localization of advanced battery production, likely influencing competitors and U.S. supply chain policy for critical EV components.

Notice how the classifier’s output drives the entity extractor’s configuration — technology articles get “technology” and “product” entity types, finance articles get “money.” This is adaptive behavior you can’t get from a single static prompt.


Wrap-Up

Key Takeaways

What’s Next

In Week 10, we turn to a question that emerged naturally from our work this week: what happens when the information you need isn’t in the model’s training data? The answer is Retrieval-Augmented Generation (RAG) — storing documents in vector databases, retrieving relevant chunks at query time, and grounding LLM responses in specific sources. You’ll see how the structured outputs and tool use patterns from this week become essential building blocks of RAG pipelines.