Working with LLM APIs - UCF CAP-6640

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L08.01: Foundation models, open vs. closed models
L08.02: Fine-tuning and alignment (RLHF, DPO), the post-training pipeline

Outcomes

Explain the role of an API gateway in unifying access to multiple LLM providers
Use PydanticAI to make API calls to LLMs through a shared proxy
Compare model responses across providers and understand token usage and pricing trade-offs
Extract structured data from LLM responses using Pydantic output types

References

Why APIs?¶

In Part 02, we explored how foundation models are customized through fine-tuning and alignment. But here’s a practical question: how do you actually use these models?

You can’t download GPT-5.4 — it’s a closed model with hundreds of billions of parameters running on OpenAI’s infrastructure. Same for Claude Opus 4.6 and Gemini 3.1 Pro. The only way to interact with these frontier models is through an API (Application Programming Interface): you send a request over the internet, the provider runs inference on their hardware, and you get a response back.

This turns out to be incredibly powerful. Instead of needing a GPU cluster to run a model, you need a few lines of Python and an API key. The trade-off is clear: you give up control over the model in exchange for instant access to the most capable systems in the world.

But there’s a catch — every provider has its own SDK, its own authentication scheme, its own request format. OpenAI uses one library, Anthropic another, Google yet another. In this lecture, we’ll solve that problem with two tools: LiteLLM as a unified API gateway and PydanticAI as our type-safe Python framework for talking to any model through a single interface.

The API Gateway Pattern¶

The Problem: SDK Sprawl¶

If you wanted to compare responses from GPT-5.4, Claude Sonnet 4.6, and Claude Haiku 4.5, you’d normally need to:

Install three different Python packages (openai, anthropic, google-genai)
Manage three different API keys
Learn three different request/response formats
Handle three different error types

That’s a lot of friction just to ask a question.

The Solution: LiteLLM Proxy¶

LiteLLM is an API gateway that sits between your code and the LLM providers. It exposes a single, OpenAI-compatible endpoint — meaning any tool that can talk to OpenAI can automatically talk to Claude, Gemini, or dozens of other providers. The proxy handles the translation.

Figure 1:The API gateway pattern: your code talks to one endpoint, and the gateway routes requests to the appropriate provider. Students authenticate with personal API keys; the gateway manages the actual provider credentials.

For this course, we’ve set up a shared LiteLLM proxy with three models available:

Model Name	Provider	Capability Tier
`gpt-5.4`	OpenAI	Frontier
`claude-sonnet-4-6`	Anthropic	Frontier
`claude-haiku-4-5`	Anthropic	Fast & affordable

Each of you has a personal API key that gives you access to all three models through a single URL. The proxy tracks your usage and enforces per-student budgets — so experiment freely, but be mindful of cost.

PydanticAI: Your LLM Framework¶

What Is PydanticAI?¶

PydanticAI is a Python framework for building LLM-powered applications, created by the team behind Pydantic (the data validation library you may know from FastAPI). Its key selling points:

Model-agnostic: works with OpenAI, Anthropic, Google, and any OpenAI-compatible endpoint
Type-safe: leverages Pydantic models for structured inputs and outputs
Simple API: the core abstraction is an Agent — configure it once, then call it

Think of it as the “FastAPI of LLM development” — it handles the boilerplate so you can focus on what matters.

Connecting to Our Proxy¶

Let’s set up our connection. You’ll need your API key — create a file called .env in the project root (or the same directory as your notebook) with:

CAP6640_API_KEY="sk-your-personal-key-here"

The setup code below uses python-dotenv to load this file automatically, so you don’t need to set environment variables manually.

import os
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

# Load API key from .env file
load_dotenv()

# Course LiteLLM proxy — one URL for all models
PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"

def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )

That’s the entire setup. The get_model function creates a connection to any model available on our proxy. Let’s use it.

Your First API Call¶

# Create an agent with GPT-5.4
agent = Agent(
    get_model("gpt-5.4"),
    instructions="You are a concise NLP tutor. Answer in 2-3 sentences.",
)

result = await agent.run("What is tokenization in NLP?")
print(result.output)

Tokenization is the process of splitting text into smaller units called tokens, such as words, subwords, or characters. It helps NLP models turn raw text into pieces they can analyze, count, or convert into numerical representations.

Let’s unpack what happened:

Agent is the core PydanticAI abstraction — it wraps a model with configuration (like a system prompt)
instructions provides directives that shape the model’s behavior (PydanticAI’s recommended alternative to system_prompt — instructions are excluded from message history between runs, while system_prompt is preserved)
await agent.run(...) sends the user message to the model and waits for a response. We use await because PydanticAI’s API is asynchronous (see note below).
result.output contains the model’s text response

Why await instead of calling functions normally?

PydanticAI (and many modern Python AI libraries) uses async/await for operations that involve network calls — like sending a request to an LLM API and waiting for the response. In a regular Python script, you’d need to set up an event loop yourself. But Jupyter notebooks already have an event loop running, so you can use await directly in code cells — no extra setup needed.

The key rule: any time you call an async method (like agent.run(), dataset.evaluate(), etc.), put await in front of it. If you forget, you’ll get a coroutine object instead of the actual result.

# Correct — gets the actual result
result = await agent.run("Hello")

# Wrong — returns a coroutine object, not the answer
result = agent.run("Hello")  # <coroutine object at 0x...>

You’ll also see some libraries offer a _sync variant (like evaluate_sync()). These work in regular scripts but fail in Jupyter because they try to create a new event loop while one is already running. We always use the await version in notebooks.

Understanding Token Usage¶

Every API call consumes tokens — and tokens cost money. Let’s inspect the usage:

print(f"Input tokens:  {result.usage().input_tokens}")
print(f"Output tokens: {result.usage().output_tokens}")
print(f"Total tokens:  {result.usage().total_tokens}")

Input tokens:  32
Output tokens: 49
Total tokens:  81

Why does this matter? LLM providers charge per token, with output tokens typically costing 3-5x more than input tokens. A rough mental model:

~1 token ≈ ¾ of a word (in English)
A typical prompt + response might use 500–2,000 tokens
Frontier models (GPT-5.4, Claude Sonnet 4.6): ~$3–15 per million input tokens
Fast models (Claude Haiku 4.5): ~$0.25–1 per million input tokens

The cost difference between model tiers is significant — which is why choosing the right model for your task matters.

Comparing Models¶

One of the most valuable things you can do with API access is compare models side by side. Different models have different strengths: some are more concise, some more creative, some faster, some cheaper.

Let’s ask all three models the same question and compare:

models = {
    "GPT-5.4": get_model("gpt-5.4"),
    "Claude Sonnet 4.6": get_model("claude-sonnet-4-6"),
    "Claude Haiku 4.5": get_model("claude-haiku-4-5"),
}

prompt = "Explain the difference between stemming and lemmatization in exactly 3 sentences."

for name, model in models.items():
    agent = Agent(model, instructions="You are a concise NLP instructor.")
    result = await agent.run(prompt)
    usage = result.usage()
    print(f"--- {name} ---")
    print(result.output)
    print(f"  [tokens: {usage.input_tokens} in / {usage.output_tokens} out]\n")

--- GPT-5.4 ---
Stemming reduces words to a crude base form by chopping off prefixes or suffixes, often without ensuring the result is a real word.  
Lemmatization reduces words to their dictionary base form, using vocabulary and often part-of-speech information to return valid words.  
For example, stemming might turn “studies” into “studi,” while lemmatization turns it into “study.”
  [tokens: 32 in / 84 out]

--- Claude Sonnet 4.6 ---
Stemming is a rule-based process that strips suffixes from words to reduce them to a root form, often producing non-real words (e.g., "running" → "runn"). Lemmatization, by contrast, uses vocabulary and morphological analysis to return a word to its true dictionary base form, called a lemma (e.g., "running" → "run"). While stemming is faster and simpler, lemmatization is more accurate and linguistically meaningful, making it preferable when precision matters.
  [tokens: 35 in / 114 out]

--- Claude Haiku 4.5 ---
**Stemming** removes suffixes from words using rule-based algorithms to reduce them to a root form, which may not be a valid word (e.g., "running" → "runn"). **Lemmatization** uses linguistic knowledge and vocabulary to convert words to their canonical dictionary form, ensuring the result is a real word (e.g., "running" → "run"). Lemmatization is more accurate but computationally expensive, while stemming is faster but produces less precise results.
  [tokens: 34 in / 109 out]

What to Look For¶

When comparing models, pay attention to:

Instruction adherence: Did it follow the “exactly 3 sentences” constraint?
Accuracy: Are the definitions correct?
Style: Which response is clearest for a student audience?
Token efficiency: Which model used fewer output tokens?
Latency: Which responded fastest? (harder to measure here, but noticeable in practice)

Choosing the Right Model¶

There’s no single “best” model — it depends on your task:

Use Case	Recommended Model	Why
Complex reasoning, analysis	GPT-5.4 or Claude Sonnet 4.6	Maximum capability
Simple classification, extraction	Claude Haiku 4.5	Fast and cheap
Creative writing	Experiment!	Style varies by model
High-volume processing	Claude Haiku 4.5	Cost-effective at scale

The general principle: use the cheapest model that meets your quality bar. Start with a fast model, evaluate its output, and only upgrade if needed.

A Taste of Structured Output¶

So far, our models have returned free-form text. That’s fine for conversation, but what if you need the output in a specific format — say, a Python dictionary or a JSON object?

Parsing free text is fragile. What if the model adds extra words? What if the format changes slightly between calls? This is where PydanticAI’s structured output shines.

The Idea¶

Instead of getting a string back, you define a Pydantic model describing the shape of the output you want. PydanticAI sends the schema to the LLM, validates the response, and returns a proper Python object — with type checking and all.

from pydantic import BaseModel, Field


class SentimentResult(BaseModel):
    """Structured output for sentiment analysis."""
    text: str = Field(description="The original text that was analyzed")
    sentiment: str = Field(description="positive, negative, or neutral")
    confidence: float = Field(ge=0, le=1, description="Confidence score between 0 and 1")
    reasoning: str = Field(description="Brief explanation of the sentiment judgment")


agent = Agent(
    get_model("claude-sonnet-4-6"),
    output_type=SentimentResult,
    instructions="Analyze the sentiment of the given text.",
)

result = await agent.run("The new spaCy update is incredibly fast but the documentation is lacking.")
print(f"Sentiment:  {result.output.sentiment}")
print(f"Confidence: {result.output.confidence}")
print(f"Reasoning:  {result.output.reasoning}")

Sentiment:  neutral
Confidence: 0.85
Reasoning:  The text contains both a strong positive sentiment ("incredibly fast") and a negative sentiment ("documentation is lacking"). These opposing sentiments balance each other out, resulting in an overall neutral sentiment. The use of "but" explicitly signals a contrast between the praise and the criticism.

Notice what happened: result.output is not a string — it’s a SentimentResult object with typed fields. PydanticAI handled the schema conversion, the API call, the response parsing, and the validation automatically.

This is a preview of what we’ll explore much more deeply in Week 9, where we’ll cover structured output extraction, function calling, and building full data pipelines with LLMs.

Exercise 8.4: Compare Models on a Task

Connect to the course LiteLLM proxy and compare two models on a task of your choice.

Pick a short NLP task (e.g., “Summarize this paragraph,” “Classify this review as positive or negative,” or “Extract the key entities from this sentence”).
Send the same prompt to two different models (e.g., gpt-5.4 and claude-haiku-4-5).
Compare the responses on: accuracy, style, token usage, and (subjectively) which you’d prefer.

# Starter code — fill in your prompt and compare
models_to_compare = ["gpt-5.4", "claude-haiku-4-5"]
prompt = "YOUR TASK HERE"

for model_name in models_to_compare:
    agent = Agent(
        get_model(model_name),
        instructions="You are a helpful NLP assistant.",
    )
    result = await agent.run(prompt)
    usage = result.usage()
    print(f"--- {model_name} ---")
    print(result.output)
    print(f"  [tokens: {usage.input_tokens} in / {usage.output_tokens} out]\n")

Bonus: Try the same comparison with a structured output type. Does one model produce better structured responses than the other?

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Closed models like GPT-5.4 and Claude are accessed exclusively through APIs — you send requests and receive responses over the internet
LiteLLM acts as an API gateway, providing a single OpenAI-compatible endpoint that routes to multiple providers — eliminating SDK sprawl
PydanticAI provides a type-safe, model-agnostic framework for making LLM API calls; its core abstraction is the Agent
System prompts are set via the instructions parameter in PydanticAI (recommended over system_prompt because instructions are excluded from message history between runs)
Token usage determines API cost: output tokens cost 3-5x more than input tokens, and frontier models cost 10-30x more than fast models
Model selection should match the task: use the cheapest model that meets your quality bar
Structured output lets you get typed Python objects instead of free text — PydanticAI handles schema conversion and validation automatically

What’s Next¶

In Week 9, we’ll go much deeper into the art and science of working with LLMs. Part 01 covers prompt engineering — techniques like zero-shot and few-shot prompting, chain-of-thought reasoning, and system prompt design. Part 02 explores structured outputs and function calling in depth, building full data extraction pipelines. And in the lab, you’ll put it all together in an API power workshop.