Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab — Model Safari

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Overview

In Parts 01 and 02, we learned the theory behind transformer model variants and the Hugging Face tools to use them. Now it’s time to go on safari.

In this lab, we’ll visit five “stops” — each one a different NLP task powered by a Hugging Face pipeline. At each stop, you’ll run the models, inspect the outputs, and push the models to their limits. The final stop pits encoder-only against decoder-only models head-to-head on the same inputs.

By the end, you’ll have hands-on intuition for which model does what well — and why.


Setup

from transformers import pipeline, set_seed

Stop 1: Text Classification

Our first stop is sentiment analysis — the classic text classification task. In Part 02, we saw this in three lines. Now let’s explore it more deeply.

Batch Classification

Pipelines can process a list of inputs in one call:

classifier = pipeline("sentiment-analysis")

reviews = [
    "This movie was absolutely fantastic! The acting was superb.",
    "Terrible film. Boring plot and awful dialogue.",
    "It was okay, nothing special but not bad either.",
    "The cinematography was beautiful but the story fell flat.",
    "I can't believe how good this was. A masterpiece!",
]

results = classifier(reviews)

for review, result in zip(reviews, results):
    label = result["label"]
    score = result["score"]
    print(f"  {label:10s} ({score:.4f})  {review[:55]}")
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading...
  POSITIVE   (0.9999)  This movie was absolutely fantastic! The acting was sup
  NEGATIVE   (0.9998)  Terrible film. Boring plot and awful dialogue.
  POSITIVE   (0.9891)  It was okay, nothing special but not bad either.
  NEGATIVE   (0.9996)  The cinematography was beautiful but the story fell fla
  POSITIVE   (0.9998)  I can't believe how good this was. A masterpiece!

Interpreting Confidence

Look at the scores — most are above 0.99. The model is very confident. But review #3 (“It was okay...”) and #4 (“The cinematography was beautiful but...”) are interesting cases. Mixed-sentiment reviews are harder — the model must weigh positive and negative signals against each other.

Also notice: review #3 is labeled POSITIVE even though it’s arguably neutral. The model was trained on binary sentiment data (SST-2), so it has no “neutral” option. This is a limitation of the training data, not the architecture.


Stop 2: Named Entity Recognition

NER is a sequence labeling task — instead of one label per document, the model assigns a label to each token. In Week 4, we built NER systems with SpaCy. Now let’s see how transformer-based NER compares.

ner = pipeline("ner", aggregation_strategy="simple")

text = "Apple Inc. reported record revenue of $123 billion in Cupertino, California."
entities = ner(text)

print(f"Text: {text}\n")
for entity in entities:
    print(f"  {entity['word']:30s} → {entity['entity_group']:5s} (score: {entity['score']:.3f})")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496.
Using a pipeline without specifying a model name and revision in production is not recommended.
Loading...
BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.weight | UNEXPECTED |  | 
bert.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Text: Apple Inc. reported record revenue of $123 billion in Cupertino, California.

  Apple Inc                      → ORG   (score: 0.999)
  Cupertino                      → LOC   (score: 0.970)
  California                     → LOC   (score: 1.000)

The aggregation_strategy="simple" parameter merges subword tokens back into complete entities. Without it, you’d see individual pieces like “Cup” and “##ertino” as separate entities.

Testing Across Domains

texts = [
    "Dr. Sarah Chen at MIT published a breakthrough paper on protein folding.",
    "Taylor Swift performed at the Eras Tour in London's Wembley Stadium last August.",
    "The Amazon River flows through Brazil, Peru, and Colombia.",
]

for text in texts:
    entities = ner(text)
    print(f"Text: {text}")
    for e in entities:
        print(f"  {e['word']:30s} → {e['entity_group']:5s} ({e['score']:.3f})")
    print()
Text: Dr. Sarah Chen at MIT published a breakthrough paper on protein folding.
  Sarah Chen                     → PER   (1.000)
  MIT                            → ORG   (0.994)

Text: Taylor Swift performed at the Eras Tour in London's Wembley Stadium last August.
  Taylor                         → PER   (0.674)
  Swift                          → ORG   (0.527)
  Eras Tour                      → MISC  (0.996)
  London                         → LOC   (1.000)
  Wembley Stadium                → LOC   (0.988)

Text: The Amazon River flows through Brazil, Peru, and Colombia.
  Amazon River                   → LOC   (0.766)
  Brazil                         → LOC   (0.999)
  Peru                           → LOC   (1.000)
  Colombia                       → LOC   (1.000)

Notice that the model correctly identifies “Amazon” as a location (the river) rather than an organization — the surrounding context (“River,” “flows through”) disambiguates it. This is the power of contextual embeddings: the same word gets different representations depending on context.


Stop 3: Question Answering

Extractive question answering is a task where the model finds the answer within a given passage of text. It doesn’t generate new text — it highlights a span. This is a natural fit for encoder-only models, which excel at understanding text bidirectionally.

qa = pipeline("question-answering")

context = """
Amazon was founded by Jeff Bezos in 1994 in his garage in Bellevue, Washington.
It started as an online bookstore before expanding into virtually every product
category. Today, Amazon is one of the most valuable companies in the world.
"""

questions = [
    "Who founded Amazon?",
    "When was Amazon founded?",
    "What did Amazon start as?",
    "Where was Amazon founded?",
]

for question in questions:
    result = qa(question=question, context=context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (score: {result['score']:.4f})\n")
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5.
Using a pipeline without specifying a model name and revision in production is not recommended.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
  Q: Who founded Amazon?
  A: Jeff Bezos (score: 0.9983)

  Q: When was Amazon founded?
  A: 1994 (score: 0.9960)

  Q: What did Amazon start as?
  A: online bookstore (score: 0.6314)

  Q: Where was Amazon founded?
  A: Bellevue, Washington (score: 0.8680)

Context Dependence

The same question can produce different answers with different contexts — the model reads the passage, not its “memory”:

question = "When was the company founded?"

contexts = {
    "Amazon": "Amazon was founded by Jeff Bezos in 1994 in Bellevue, Washington.",
    "Google": "Google was founded in 1998 by Larry Page and Sergey Brin at Stanford University.",
}

for company, ctx in contexts.items():
    result = qa(question=question, context=ctx)
    print(f"  Context: {company:10s}  →  Answer: {result['answer']} ({result['score']:.4f})")
  Context: Amazon      →  Answer: 1994 (0.9896)
  Context: Google      →  Answer: 1998 (0.9776)

This is important: the QA model doesn’t “know” facts — it extracts answers from the context you provide. If the answer isn’t in the context, the model will either guess (with low confidence) or return a nonsensical span.


Stop 4: Text Generation

Text generation is the signature task of decoder-only models. Given a prompt, the model continues the text one token at a time using causal language modeling — each token can only attend to tokens that came before it.

generator = pipeline("text-generation", model="gpt2")

prompt = "The future of artificial intelligence is"
result = generator(prompt, max_new_tokens=30, do_sample=False, pad_token_id=50256)
print(result[0]["generated_text"])
Loading...
GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'do_sample', 'pad_token_id'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
The future of artificial intelligence is uncertain.

"We're not sure what the future will look like," said Dr. Michael S. Schoenfeld, a professor of computer

Decoding Parameters

The do_sample=False setting above uses greedy decoding — always picking the most probable next token. This is deterministic but can produce repetitive text. Let’s explore other strategies:

prompt = "Once upon a time in a small village, there lived"

# Greedy: deterministic, can be repetitive
result = generator(prompt, max_new_tokens=30, do_sample=False, pad_token_id=50256)
print(f"Greedy:     ...{result[0]['generated_text'][len(prompt):]}")

# Low temperature: more focused, still some randomness
set_seed(42)
result = generator(prompt, max_new_tokens=30, do_sample=True, temperature=0.7, pad_token_id=50256)
print(f"Temp=0.7:   ...{result[0]['generated_text'][len(prompt):]}")

# High temperature: more creative/chaotic
set_seed(42)
result = generator(prompt, max_new_tokens=30, do_sample=True, temperature=1.5, pad_token_id=50256)
print(f"Temp=1.5:   ...{result[0]['generated_text'][len(prompt):]}")
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'do_sample', 'pad_token_id', 'temperature'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Greedy:     ... a man who was a great man. He was a man of great wealth and great power. He was a man of great wealth and great power.
Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Temp=0.7:   ... a daughter of a merchant. She was the daughter of a merchant, and once again she was the daughter of a merchant. She was the sister of
Temp=1.5:   ... three men -- a boy named Miho, who, to escape the dark and dreadful place, refused an inn's money deposit for years. After

Temperature controls the “sharpness” of the probability distribution:

This is a knob you’ll use constantly when working with language models. The right setting depends on your application: factual Q&A wants low temperature; creative writing wants higher.


Stop 5: The Architecture Showdown

This is the main event. We’ve used encoder-only models (classification, NER, QA) and a decoder-only model (text generation) separately. Now let’s put them side by side on the same inputs and see how their different architectures lead to fundamentally different behaviors.

Round 1: Fill-in-the-Blank vs. Continue-the-Text

Recall from Part 01: encoder-only models see all tokens (bidirectional attention), while decoder-only models see only preceding tokens (causal attention). What does this mean in practice?

fill_mask = pipeline("fill-mask", model="distilbert-base-uncased")
generator = pipeline("text-generation", model="gpt2")

# Pairs: (masked sentence for encoder, prompt for decoder)
examples = [
    ("She went to the [MASK] to buy groceries.", "She went to the"),
    ("The movie was [MASK] and I loved it.", "The movie was"),
    ("Python is a popular [MASK] language.", "Python is a popular"),
]

print("=" * 70)
print("ENCODER-ONLY: Fill in the blank (sees ALL context)")
print("=" * 70)
for masked, _ in examples:
    results = fill_mask(masked)
    top3 = ", ".join(f"{r['token_str']} ({r['score']:.2f})" for r in results[:3])
    print(f"\n  {masked}")
    print(f"  Top predictions: {top3}")

print(f"\n{'=' * 70}")
print("DECODER-ONLY: Continue the text (sees only LEFT context)")
print("=" * 70)
for _, prompt in examples:
    result = generator(prompt, max_new_tokens=8, do_sample=False, pad_token_id=50256)
    continuation = result[0]["generated_text"][len(prompt):]
    print(f"\n  {prompt}...")
    print(f"  Continuation: {continuation.strip()}")
Loading...
Loading...
Loading...
GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
======================================================================
ENCODER-ONLY: Fill in the blank (sees ALL context)
======================================================================

  She went to the [MASK] to buy groceries.
  Top predictions: supermarket (0.21), store (0.20), mall (0.17)

  The movie was [MASK] and I loved it.
  Top predictions: awesome (0.09), great (0.07), fantastic (0.06)

  Python is a popular [MASK] language.
  Top predictions: programming (0.90), python (0.06), modeling (0.01)

======================================================================
DECODER-ONLY: Continue the text (sees only LEFT context)
======================================================================
Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

  She went to the...
  Continuation: hospital and was treated for a broken nose
Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

  The movie was...
  Continuation: released in Japan on May 7, 2016

  Python is a popular...
  Continuation: programming language, and it's easy to

The difference is dramatic:

This is the key insight from Part 01 in action: bidirectional context helps you understand; left-to-right context helps you generate.

Bidirectional attention resolves ambiguity by seeing the full sentence. Causal attention must commit to an interpretation before seeing right context.

Figure 1:Bidirectional attention resolves ambiguity by seeing the full sentence. Causal attention must commit to an interpretation before seeing right context.

Round 2: Sentiment Classification

Now let’s compare the two architectures on a task that encoder-only models are specifically fine-tuned for — sentiment classification:

sentiment = pipeline("sentiment-analysis")  # Encoder-only (fine-tuned DistilBERT)
generator = pipeline("text-generation", model="gpt2")  # Decoder-only (no fine-tuning)

test_reviews = [
    "This restaurant has the best pasta I've ever tasted!",
    "The service was slow and the food was cold. Never coming back.",
    "Decent place, nothing extraordinary but gets the job done.",
]

print("=" * 70)
print("ENCODER-ONLY: Fine-tuned sentiment classifier")
print("=" * 70)
for review in test_reviews:
    result = sentiment(review)
    print(f"  {result[0]['label']:10s} ({result[0]['score']:.4f})  {review[:55]}")

print(f"\n{'=' * 70}")
print("DECODER-ONLY: GPT-2 zero-shot (prompting)")
print("=" * 70)
for review in test_reviews:
    prompt = f'Review: "{review}"\nSentiment (positive or negative):'
    result = generator(prompt, max_new_tokens=5, do_sample=False, pad_token_id=50256)
    generated = result[0]["generated_text"][len(prompt):]
    print(f"  Generated: {generated.strip():30s}  {review[:40]}...")
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.
Loading...
Loading...
GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
======================================================================
ENCODER-ONLY: Fine-tuned sentiment classifier
======================================================================
  POSITIVE   (0.9998)  This restaurant has the best pasta I've ever tasted!
  NEGATIVE   (0.9997)  The service was slow and the food was cold. Never comin
  POSITIVE   (0.9989)  Decent place, nothing extraordinary but gets the job do

======================================================================
DECODER-ONLY: GPT-2 zero-shot (prompting)
======================================================================
Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  Generated: "I'm so happy                   This restaurant has the best pasta I've ...
Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  Generated: "I was disappointed with        The service was slow and the food was co...
  Generated: "I'm not sure                   Decent place, nothing extraordinary but ...

The encoder-only classifier gives clean, confident predictions with a single label and probability score. GPT-2, on the other hand, wasn’t trained for classification — it just continues the text. Without instruction tuning (which we’ll see in Week 8), small decoder-only models struggle with structured tasks like classification.

The Takeaway

This comparison illustrates a theme from Part 01:

The right choice depends on your use case: if you need a fast, accurate classifier for production, fine-tune an encoder. If you need flexibility and general-purpose capability, reach for a larger decoder-only model.

A small fine-tuned encoder outperforms a small zero-shot decoder on structured tasks like sentiment classification.

Figure 2:A small fine-tuned encoder outperforms a small zero-shot decoder on structured tasks like sentiment classification.


Wrap-Up

Key Takeaways

What’s Next

In Week 8, we’ll zoom out from individual models to the broader landscape of foundation models and modern LLMs. We’ll explore scaling laws (why bigger models are better), the spectrum of open vs. closed models, and how techniques like LoRA let you fine-tune massive models on a laptop. The hands-on experience you’ve built today — using pipelines, comparing architectures, and probing model behavior — sets the stage for working with the full-scale models that power today’s AI applications.