Introduction to LLM APIs Lab

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L08.03: Working with LLM APIs — PydanticAI, get_model() helper, LiteLLM proxy connection, basic API calls and structured output

Outcomes

Verify API connectivity and make calls to all three course models
Explore how generation parameters (temperature, max_tokens) affect model output
Build a mini text analysis tool that compares models on a real NLP task
Reason about model selection trade-offs based on empirical observation

References

Setup & Warm-Up¶

Let’s make sure everyone is connected to the course proxy and can reach all three models. We’ll reuse the get_model helper from Part 03.

import os
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

# Load API key from .env file
load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"

def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

Exercise 8.5: Connection Check

Verify that you can reach all three models on the course proxy. Run the cell below — you should see a short response from each model.

for model_name in ["gpt-5.4", "claude-sonnet-4-6", "claude-haiku-4-5"]:
    agent = Agent(get_model(model_name), instructions="Reply in exactly one sentence.")
    result = await agent.run("Say hello and tell me which model you are.")
    print(f"{model_name}: {result.output}\n")

If any model fails, check:

Is your CAP6640_API_KEY environment variable set?
Are you connected to the internet?
Ask the instructor for help if both are fine.

Generation Parameters¶

In Part 03, we used models with their default settings. But LLMs have several generation parameters that control how they produce text. The two most important are temperature and max_tokens.

Temperature: The Creativity Dial¶

Temperature controls the randomness of the model’s output. It adjusts the probability distribution over possible next tokens:

Temperature 0.0: The model always picks the most likely token. Output is deterministic — run the same prompt twice, get the same answer. Best for factual tasks where you want consistency.
Temperature 0.5–0.7: A moderate amount of randomness. Good default for most tasks.
Temperature 1.0: Maximum randomness (the upper limit for most providers). More creative and varied output. Useful for brainstorming or creative writing.

Let’s see this in action:

from pydantic_ai.settings import ModelSettings

prompt = "Write a one-sentence description of a haunted house."

for temp in [0.0, 0.7, 1.0]:
    agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions="You are a creative writer.",
        model_settings=ModelSettings(temperature=temp),
    )
    print(f"--- Temperature {temp} ---")
    # Run the same prompt 3 times to see variation (or lack thereof)
    for i in range(3):
        result = await agent.run(prompt)
        print(f"  Run {i+1}: {result.output}")
    print()

--- Temperature 0.0 ---

  Run 1: A decaying Victorian mansion shrouded in perpetual fog harbors the restless spirits of its tragic past, their anguished whispers echoing through empty halls where shadows move of their own accord.

  Run 2: A decaying Victorian mansion shrouded in perpetual fog harbors the restless spirits of its tragic past, their anguished whispers echoing through empty halls where shadows move of their own accord.

  Run 3: A decaying Victorian mansion stands shrouded in perpetual fog, its windows glowing with spectral light as the anguished whispers of its former inhabitants echo through corridors where time itself seems to have stopped.

--- Temperature 0.7 ---

  Run 1: A decrepit Victorian mansion looms against the storm clouds, its broken windows like hollow eyes watching the living, while the anguished whispers of its tormented past echo through halls where time itself seems to have frozen in eternal darkness.

  Run 2: A decrepit Victorian mansion shrouded in perpetual fog harbors the restless spirits of its tragic past, their anguished whispers echoing through empty halls where shadows move of their own accord.

  Run 3: A crumbling Victorian mansion stands shrouded in perpetual fog, its windows glowing with an eerie light as the anguished whispers of its former inhabitants echo through corridors where time itself seems to have stopped.

--- Temperature 1.0 ---

  Run 1: A Victorian mansion shrouded in perpetual mist harbors the restless spirits of its tragic past, their anguished whispers echoing through shadowed halls where time itself seems to have frozen in the moment of their doom.

  Run 2: A crumbling Victorian mansion stands shrouded in perpetual fog, its broken windows watching like hollow eyes as the tormented spirits of its former residents wander endless halls, forever trapped between the living world and whatever darkness claims them.

  Run 3: A decrepit Victorian mansion shrouded in perpetual fog harbors the restless spirits of its tragic past, their anguished wails echoing through halls where time itself seems to have frozen in despair.

At temperature 0.0, all three runs should produce nearly identical output. At 1.0, each run will be noticeably different.

Max Tokens: The Length Cap¶

The max_tokens parameter sets an upper limit on how many tokens the model can generate in its response. This is useful for:

Controlling costs: Fewer output tokens = lower cost
Enforcing brevity: Force the model to be concise
Preventing runaway generation: Stop the model from producing a novel when you wanted a sentence

prompt = "Explain the transformer architecture."

for max_tok in [20, 50, 200]:
    agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions="You are an NLP instructor.",
        model_settings=ModelSettings(max_tokens=max_tok),
    )
    result = await agent.run(prompt)
    usage = result.usage()
    print(f"--- max_tokens={max_tok} ---")
    print(f"  Output ({usage.output_tokens} tokens): {result.output}")
    print()

--- max_tokens=20 ---
  Output (20 tokens): # The Transformer Architecture

The Transformer is a deep learning model introduced in "Attention is

--- max_tokens=50 ---
  Output (50 tokens): # The Transformer Architecture

The Transformer is a neural network architecture designed for sequence-to-sequence tasks. Here's a comprehensive breakdown:

## Core Components

### 1. **Self-Attention Mechanism**
The heart of

--- max_tokens=200 ---
  Output (200 tokens): # The Transformer Architecture

## Overview
The Transformer is a deep learning model that revolutionized NLP by replacing recurrence with **self-attention**, enabling parallel processing and capturing long-range dependencies efficiently.

---

## Core Components

### 1. **Self-Attention Mechanism**
The heart of transformers—allows each token to attend to every other token in a sequence.

**How it works:**
- Input embeddings are projected into three matrices: **Query (Q)**, **Key (K)**, **Value (V)**
- Attention scores are calculated: `Attention(Q,K,V) = softmax(QK^T/√d_k)V`
- This computes how much each position should "focus on" other positions

**Key insight:** Tokens can directly access any other token, unlike RNNs that process sequentially.

### 2. **Multi-Head

Notice that max_tokens=20 will cut the response off mid-sentence — the model doesn’t know in advance how many tokens it gets. It’s a hard ceiling, not a soft suggestion.

Exercise 8.6: The Temperature Experiment

Pick a creative prompt of your choice (e.g., “Invent a new NLP algorithm and name it” or “Write a haiku about neural networks”).

Run the prompt at three temperature settings: 0.0, 0.7, and 1.0
For each temperature, run the prompt 3 times to observe variation
Answer these questions:
- At which temperature do you see the most repetition across runs?
- At which temperature does the output become least coherent?
- What temperature would you choose for a factual QA system? For a creative writing assistant?

prompt = "YOUR CREATIVE PROMPT HERE"

for temp in [0.0, 0.7, 1.0]:
    agent = Agent(
        get_model("claude-haiku-4-5"),
        instructions="Be creative and concise.",
        model_settings=ModelSettings(temperature=temp),
    )
    print(f"=== Temperature {temp} ===")
    for i in range(3):
        result = await agent.run(prompt)
        print(f"  Run {i+1}: {result.output}")
    print()

The Model Showdown¶

Now let’s put everything together. In this mini-project, you’ll pick a piece of text and run three different NLP tasks across all three course models — building a comparison table of results.

Here’s a sample text to work with (or bring your own):

sample_text = """
OpenAI announced GPT-5.4 at their Spring 2026 developer conference in San Francisco.
The model achieves state-of-the-art performance on reasoning benchmarks, surpassing
its predecessor by 15% on the MMLU-Pro evaluation suite. Critics argue that the
environmental cost of training such large models remains a significant concern, while
supporters point to breakthroughs in scientific research enabled by the technology.
CEO Sam Altman described it as "the most capable model we've ever built."
"""

Let’s define three NLP tasks and run them across all models:

from pydantic import BaseModel, Field


# --- Task 1: Summarization ---
summarize_agent = {
    name: Agent(
        get_model(model_id),
        instructions="Summarize the given text in exactly one sentence.",
    )
    for name, model_id in [
        ("GPT-5.4", "gpt-5.4"),
        ("Sonnet 4.6", "claude-sonnet-4-6"),
        ("Haiku 4.5", "claude-haiku-4-5"),
    ]
}

print("=== Task 1: One-Sentence Summary ===\n")
for name, agent in summarize_agent.items():
    result = await agent.run(sample_text)
    print(f"  {name}: {result.output}\n")

=== Task 1: One-Sentence Summary ===

  GPT-5.4: OpenAI announced GPT-5.4 at its Spring 2026 developer conference in San Francisco, touting state-of-the-art reasoning performance and a 15% MMLU-Pro improvement over its predecessor, while critics raised environmental concerns and supporters highlighted scientific breakthroughs.

  Sonnet 4.6: OpenAI unveiled GPT-5.4 at its Spring 2026 developer conference, touting it as their most capable model yet with a 15% improvement on reasoning benchmarks, though environmental concerns over large-scale AI training persist.

  Haiku 4.5: OpenAI announced GPT-5.4 at their Spring 2026 conference, achieving state-of-the-art reasoning performance with a 15% improvement over its predecessor, though concerns about environmental costs persist alongside recognition of its scientific research benefits.

# --- Task 2: Entity Extraction (structured output) ---
class ExtractedEntities(BaseModel):
    people: list[str] = Field(description="Names of people mentioned")
    organizations: list[str] = Field(description="Names of organizations mentioned")
    locations: list[str] = Field(description="Names of locations mentioned")

print("=== Task 2: Entity Extraction ===\n")
for name, model_id in [("GPT-5.4", "gpt-5.4"), ("Sonnet 4.6", "claude-sonnet-4-6"), ("Haiku 4.5", "claude-haiku-4-5")]:
    agent = Agent(
        get_model(model_id),
        output_type=ExtractedEntities,
        instructions="Extract named entities from the given text.",
    )
    result = await agent.run(sample_text)
    ents = result.output
    print(f"  {name}:")
    print(f"    People:        {ents.people}")
    print(f"    Organizations: {ents.organizations}")
    print(f"    Locations:     {ents.locations}\n")

=== Task 2: Entity Extraction ===

  GPT-5.4:
    People:        ['Sam Altman']
    Organizations: ['OpenAI']
    Locations:     ['San Francisco']

  Sonnet 4.6:
    People:        ['Sam Altman']
    Organizations: ['OpenAI']
    Locations:     ['San Francisco']

  Haiku 4.5:
    People:        ['Sam Altman']
    Organizations: ['OpenAI']
    Locations:     ['San Francisco']

# --- Task 3: Sentiment Classification (structured output) ---
class SentimentResult(BaseModel):
    sentiment: str = Field(description="positive, negative, or mixed")
    confidence: float = Field(ge=0, le=1, description="Confidence score")
    reasoning: str = Field(description="One-sentence explanation")

print("=== Task 3: Sentiment Classification ===\n")
for name, model_id in [("GPT-5.4", "gpt-5.4"), ("Sonnet 4.6", "claude-sonnet-4-6"), ("Haiku 4.5", "claude-haiku-4-5")]:
    agent = Agent(
        get_model(model_id),
        output_type=SentimentResult,
        instructions="Classify the overall sentiment of the given text.",
    )
    result = await agent.run(sample_text)
    s = result.output
    usage = result.usage()
    print(f"  {name}: {s.sentiment} (confidence: {s.confidence})")
    print(f"    Reasoning: {s.reasoning}")
    print(f"    [tokens: {usage.input_tokens} in / {usage.output_tokens} out]\n")

=== Task 3: Sentiment Classification ===

  GPT-5.4: mixed (confidence: 0.94)
    Reasoning: The passage presents strong positive achievements and praise for the model alongside notable criticism about environmental costs, resulting in an overall mixed sentiment.
    [tokens: 281 in / 54 out]

  Sonnet 4.6: mixed (confidence: 0.92)
    Reasoning: The text presents both positive elements (state-of-the-art performance, scientific breakthroughs, CEO praise) and negative concerns (environmental cost of training large models), resulting in a balanced mixed sentiment.
    [tokens: 852 in / 115 out]

  Haiku 4.5: mixed (confidence: 0.75)
    Reasoning: The text presents both positive elements (performance achievements, scientific breakthroughs, CEO praise) and negative concerns (environmental cost criticisms), creating a balanced mixed sentiment overall.
    [tokens: 851 in / 107 out]

Exercise 8.7: Build Your Own Model Showdown

Now it’s your turn. Pick a text that interests you — a news article, a product review, a paragraph from a research paper, or anything else.

Define your text as a string variable
Run at least two of the following tasks across all three models:
- Summarization (one sentence or three bullet points)
- Entity extraction (using the ExtractedEntities model or your own)
- Sentiment classification
- Translation (if your text is in English, ask for a Spanish or French summary)
- Question generation (“Generate 3 quiz questions about this text”)
For at least one task, use a structured output type (a Pydantic BaseModel)
Compare the results. Note which model:
- Follows instructions most precisely
- Produces the most useful output
- Uses the fewest tokens

# Your text
my_text = """
PASTE YOUR TEXT HERE
"""

# Example: Summarize across all models
for name, model_id in [("GPT-5.4", "gpt-5.4"), ("Sonnet 4.6", "claude-sonnet-4-6"), ("Haiku 4.5", "claude-haiku-4-5")]:
    agent = Agent(
        get_model(model_id),
        instructions="YOUR INSTRUCTIONS HERE",
    )
    result = await agent.run(my_text)
    usage = result.usage()
    print(f"{name}: {result.output}")
    print(f"  [tokens: {usage.input_tokens} in / {usage.output_tokens} out]\n")

Reflection¶

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Always verify your connection first — a quick test call to each model saves debugging time later
Temperature controls output randomness: 0.0 for deterministic/factual tasks, 0.5–0.7 for balanced output, 1.0+ for creative variation
Max tokens is a hard ceiling on response length — the model doesn’t plan ahead to fit within it, so low limits can cut responses mid-sentence
Different models produce different results on the same task — instruction adherence, style, accuracy, and token efficiency all vary
Structured output with Pydantic models makes it easy to compare results programmatically across models
Model selection is task-dependent: use frontier models for complex reasoning, fast models for high-volume or simple tasks
Empirical comparison beats assumptions — always test your specific task on multiple models before committing to one

What’s Next¶

In Week 9, we move from basic API calls to mastering them. Part 01 covers prompt engineering — systematic techniques like few-shot prompting and chain-of-thought reasoning that dramatically improve output quality. Part 02 dives deep into structured outputs and function calling, building full data extraction pipelines. And the lab puts it all together in an API power workshop.