Lab — Model Safari - UCF CAP-6640

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L07.01: Transformer model variants — encoder-only, decoder-only, encoder-decoder architectures
L07.02: The Hugging Face ecosystem — pipelines, AutoTokenizer, AutoModel, datasets

Outcomes

Use Hugging Face pipelines for classification, NER, question answering, and text generation on realistic inputs
Interpret and compare pipeline outputs across different NLP tasks
Compare how encoder-only and decoder-only models approach the same language understanding challenge
Justify architecture selection for a given task based on hands-on experience

References

Overview¶

In Parts 01 and 02, we learned the theory behind transformer model variants and the Hugging Face tools to use them. Now it’s time to go on safari.

In this lab, we’ll visit five “stops” — each one a different NLP task powered by a Hugging Face pipeline. At each stop, you’ll run the models, inspect the outputs, and push the models to their limits. The final stop pits encoder-only against decoder-only models head-to-head on the same inputs.

By the end, you’ll have hands-on intuition for which model does what well — and why.

Setup¶

from transformers import pipeline, set_seed

Stop 1: Text Classification¶

Our first stop is sentiment analysis — the classic text classification task. In Part 02, we saw this in three lines. Now let’s explore it more deeply.

Batch Classification¶

Pipelines can process a list of inputs in one call:

classifier = pipeline("sentiment-analysis")

reviews = [
    "This movie was absolutely fantastic! The acting was superb.",
    "Terrible film. Boring plot and awful dialogue.",
    "It was okay, nothing special but not bad either.",
    "The cinematography was beautiful but the story fell flat.",
    "I can't believe how good this was. A masterpiece!",
]

results = classifier(reviews)

for review, result in zip(reviews, results):
    label = result["label"]
    score = result["score"]
    print(f"  {label:10s} ({score:.4f})  {review[:55]}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

  POSITIVE   (0.9999)  This movie was absolutely fantastic! The acting was sup
  NEGATIVE   (0.9998)  Terrible film. Boring plot and awful dialogue.
  POSITIVE   (0.9891)  It was okay, nothing special but not bad either.
  NEGATIVE   (0.9996)  The cinematography was beautiful but the story fell fla
  POSITIVE   (0.9998)  I can't believe how good this was. A masterpiece!

Interpreting Confidence¶

Look at the scores — most are above 0.99. The model is very confident. But review #3 (“It was okay...”) and #4 (“The cinematography was beautiful but...”) are interesting cases. Mixed-sentiment reviews are harder — the model must weigh positive and negative signals against each other.

Also notice: review #3 is labeled POSITIVE even though it’s arguably neutral. The model was trained on binary sentiment data (SST-2), so it has no “neutral” option. This is a limitation of the training data, not the architecture.

Exercise 7.8: Breaking the Classifier

Write 3-5 sentences designed to challenge the sentiment classifier. Try:

Sarcasm: “Oh great, another meeting. Just what I needed.”
Negation: “I wouldn’t say this movie is bad.”
Mixed signals: “The food was amazing but the service was awful.”
Domain shift: Technical text, poetry, or slang

For each example, predict what label you expect, then run it through the classifier. Were there any surprises? Why might the model struggle with these cases?

tricky = [
    "Oh great, another meeting. Just what I needed.",
    # Add your examples...
]
results = classifier(tricky)
for text, result in zip(tricky, results):
    print(f"  {result['label']:10s} ({result['score']:.4f})  {text}")

Stop 2: Named Entity Recognition¶

NER is a sequence labeling task — instead of one label per document, the model assigns a label to each token. In Week 4, we built NER systems with SpaCy. Now let’s see how transformer-based NER compares.

ner = pipeline("ner", aggregation_strategy="simple")

text = "Apple Inc. reported record revenue of $123 billion in Cupertino, California."
entities = ner(text)

print(f"Text: {text}\n")
for entity in entities:
    print(f"  {entity['word']:30s} → {entity['entity_group']:5s} (score: {entity['score']:.3f})")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496.
Using a pipeline without specifying a model name and revision in production is not recommended.

BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.weight | UNEXPECTED |  | 
bert.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Text: Apple Inc. reported record revenue of $123 billion in Cupertino, California.

  Apple Inc                      → ORG   (score: 0.999)
  Cupertino                      → LOC   (score: 0.970)
  California                     → LOC   (score: 1.000)

The aggregation_strategy="simple" parameter merges subword tokens back into complete entities. Without it, you’d see individual pieces like “Cup” and “##ertino” as separate entities.

Testing Across Domains¶

texts = [
    "Dr. Sarah Chen at MIT published a breakthrough paper on protein folding.",
    "Taylor Swift performed at the Eras Tour in London's Wembley Stadium last August.",
    "The Amazon River flows through Brazil, Peru, and Colombia.",
]

for text in texts:
    entities = ner(text)
    print(f"Text: {text}")
    for e in entities:
        print(f"  {e['word']:30s} → {e['entity_group']:5s} ({e['score']:.3f})")
    print()

Text: Dr. Sarah Chen at MIT published a breakthrough paper on protein folding.
  Sarah Chen                     → PER   (1.000)
  MIT                            → ORG   (0.994)

Text: Taylor Swift performed at the Eras Tour in London's Wembley Stadium last August.
  Taylor                         → PER   (0.674)
  Swift                          → ORG   (0.527)
  Eras Tour                      → MISC  (0.996)
  London                         → LOC   (1.000)
  Wembley Stadium                → LOC   (0.988)

Text: The Amazon River flows through Brazil, Peru, and Colombia.
  Amazon River                   → LOC   (0.766)
  Brazil                         → LOC   (0.999)
  Peru                           → LOC   (1.000)
  Colombia                       → LOC   (1.000)

Notice that the model correctly identifies “Amazon” as a location (the river) rather than an organization — the surrounding context (“River,” “flows through”) disambiguates it. This is the power of contextual embeddings: the same word gets different representations depending on context.

Exercise 7.9: NER Domain Stress Test

Test the NER pipeline on text from different domains. Try at least three of the following:

A tweet or social media post (informal, abbreviations, hashtags)
A legal or medical sentence (domain-specific entities)
A sentence where the same word could be different entity types (e.g., “Jordan” as a person vs. country)
A sentence in a language other than English

For each, note: Which entities were found correctly? Which were missed? Which were wrong? What patterns do you see in the model’s errors?

my_texts = [
    "Just saw @elonmusk tweet about TSLA — stock is mooning! 🚀",
    # Add your examples...
]
for text in my_texts:
    entities = ner(text)
    print(f"Text: {text}")
    for e in entities:
        print(f"  {e['word']:30s} → {e['entity_group']:5s} ({e['score']:.3f})")
    print()

Stop 3: Question Answering¶

Extractive question answering is a task where the model finds the answer within a given passage of text. It doesn’t generate new text — it highlights a span. This is a natural fit for encoder-only models, which excel at understanding text bidirectionally.

qa = pipeline("question-answering")

context = """
Amazon was founded by Jeff Bezos in 1994 in his garage in Bellevue, Washington.
It started as an online bookstore before expanding into virtually every product
category. Today, Amazon is one of the most valuable companies in the world.
"""

questions = [
    "Who founded Amazon?",
    "When was Amazon founded?",
    "What did Amazon start as?",
    "Where was Amazon founded?",
]

for question in questions:
    result = qa(question=question, context=context)
    print(f"  Q: {question}")
    print(f"  A: {result['answer']} (score: {result['score']:.4f})\n")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5.
Using a pipeline without specifying a model name and revision in production is not recommended.

  Q: Who founded Amazon?
  A: Jeff Bezos (score: 0.9983)

  Q: When was Amazon founded?
  A: 1994 (score: 0.9960)

  Q: What did Amazon start as?
  A: online bookstore (score: 0.6314)

  Q: Where was Amazon founded?
  A: Bellevue, Washington (score: 0.8680)

Context Dependence¶

The same question can produce different answers with different contexts — the model reads the passage, not its “memory”:

question = "When was the company founded?"

contexts = {
    "Amazon": "Amazon was founded by Jeff Bezos in 1994 in Bellevue, Washington.",
    "Google": "Google was founded in 1998 by Larry Page and Sergey Brin at Stanford University.",
}

for company, ctx in contexts.items():
    result = qa(question=question, context=ctx)
    print(f"  Context: {company:10s}  →  Answer: {result['answer']} ({result['score']:.4f})")

  Context: Amazon      →  Answer: 1994 (0.9896)
  Context: Google      →  Answer: 1998 (0.9776)

This is important: the QA model doesn’t “know” facts — it extracts answers from the context you provide. If the answer isn’t in the context, the model will either guess (with low confidence) or return a nonsensical span.

Exercise 7.10: Designing QA Challenges

Write a short paragraph (3-5 sentences) about a topic you know well. Then design questions that test the model’s limits:

A straightforward factual question (the answer is directly stated)
A question requiring simple inference (the answer is implied but not stated verbatim)
A question whose answer is NOT in the passage — what does the model do?

my_context = """
Your paragraph here...
"""

my_questions = [
    "Your factual question?",
    "Your inference question?",
    "Your unanswerable question?",
]

for q in my_questions:
    result = qa(question=q, context=my_context)
    print(f"  Q: {q}")
    print(f"  A: {result['answer']} (score: {result['score']:.4f})\n")

What does the confidence score tell you about the model’s certainty? Is there a threshold below which you wouldn’t trust the answer?

Stop 4: Text Generation¶

Text generation is the signature task of decoder-only models. Given a prompt, the model continues the text one token at a time using causal language modeling — each token can only attend to tokens that came before it.

generator = pipeline("text-generation", model="gpt2")

prompt = "The future of artificial intelligence is"
result = generator(prompt, max_new_tokens=30, do_sample=False, pad_token_id=50256)
print(result[0]["generated_text"])

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.

Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'do_sample', 'pad_token_id'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.

Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

The future of artificial intelligence is uncertain.

"We're not sure what the future will look like," said Dr. Michael S. Schoenfeld, a professor of computer

Decoding Parameters¶

The do_sample=False setting above uses greedy decoding — always picking the most probable next token. This is deterministic but can produce repetitive text. Let’s explore other strategies:

prompt = "Once upon a time in a small village, there lived"

# Greedy: deterministic, can be repetitive
result = generator(prompt, max_new_tokens=30, do_sample=False, pad_token_id=50256)
print(f"Greedy:     ...{result[0]['generated_text'][len(prompt):]}")

# Low temperature: more focused, still some randomness
set_seed(42)
result = generator(prompt, max_new_tokens=30, do_sample=True, temperature=0.7, pad_token_id=50256)
print(f"Temp=0.7:   ...{result[0]['generated_text'][len(prompt):]}")

# High temperature: more creative/chaotic
set_seed(42)
result = generator(prompt, max_new_tokens=30, do_sample=True, temperature=1.5, pad_token_id=50256)
print(f"Temp=1.5:   ...{result[0]['generated_text'][len(prompt):]}")

Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Passing `generation_config` together with generation-related arguments=({'max_new_tokens', 'do_sample', 'pad_token_id', 'temperature'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.

Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Greedy:     ... a man who was a great man. He was a man of great wealth and great power. He was a man of great wealth and great power.

Both `max_new_tokens` (=30) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Temp=0.7:   ... a daughter of a merchant. She was the daughter of a merchant, and once again she was the daughter of a merchant. She was the sister of

Temp=1.5:   ... three men -- a boy named Miho, who, to escape the dark and dreadful place, refused an inn's money deposit for years. After

Temperature controls the “sharpness” of the probability distribution:

Low temperature (0.1–0.7): The model is more conservative, picking high-probability tokens. Output is coherent but predictable.
High temperature (1.0–2.0): The model is more creative, sampling from lower-probability tokens. Output is diverse but can become incoherent.

This is a knob you’ll use constantly when working with language models. The right setting depends on your application: factual Q&A wants low temperature; creative writing wants higher.

Exercise 7.11: Prompt Engineering with GPT-2

Experiment with different prompts and decoding parameters. Try:

A technical prompt: “The main advantage of transformer models over RNNs is”
A creative prompt: “The dragon landed softly on the castle roof and”
A structured prompt: “Here is a list of the three largest countries by area:\n1.”

For each prompt, try greedy decoding and at least two different temperature values. Which combinations produce the most useful output? When does high temperature help vs. hurt?

prompt = "Your prompt here"
for temp in [0.3, 0.7, 1.2]:
    set_seed(42)
    result = generator(prompt, max_new_tokens=30, do_sample=True,
                       temperature=temp, pad_token_id=50256)
    continuation = result[0]["generated_text"][len(prompt):]
    print(f"  Temp={temp}:  ...{continuation}")

Stop 5: The Architecture Showdown¶

This is the main event. We’ve used encoder-only models (classification, NER, QA) and a decoder-only model (text generation) separately. Now let’s put them side by side on the same inputs and see how their different architectures lead to fundamentally different behaviors.

Round 1: Fill-in-the-Blank vs. Continue-the-Text¶

Recall from Part 01: encoder-only models see all tokens (bidirectional attention), while decoder-only models see only preceding tokens (causal attention). What does this mean in practice?

fill_mask = pipeline("fill-mask", model="distilbert-base-uncased")
generator = pipeline("text-generation", model="gpt2")

# Pairs: (masked sentence for encoder, prompt for decoder)
examples = [
    ("She went to the [MASK] to buy groceries.", "She went to the"),
    ("The movie was [MASK] and I loved it.", "The movie was"),
    ("Python is a popular [MASK] language.", "Python is a popular"),
]

print("=" * 70)
print("ENCODER-ONLY: Fill in the blank (sees ALL context)")
print("=" * 70)
for masked, _ in examples:
    results = fill_mask(masked)
    top3 = ", ".join(f"{r['token_str']} ({r['score']:.2f})" for r in results[:3])
    print(f"\n  {masked}")
    print(f"  Top predictions: {top3}")

print(f"\n{'=' * 70}")
print("DECODER-ONLY: Continue the text (sees only LEFT context)")
print("=" * 70)
for _, prompt in examples:
    result = generator(prompt, max_new_tokens=8, do_sample=False, pad_token_id=50256)
    continuation = result[0]["generated_text"][len(prompt):]
    print(f"\n  {prompt}...")
    print(f"  Continuation: {continuation.strip()}")

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

======================================================================
ENCODER-ONLY: Fill in the blank (sees ALL context)
======================================================================

  She went to the [MASK] to buy groceries.
  Top predictions: supermarket (0.21), store (0.20), mall (0.17)

  The movie was [MASK] and I loved it.
  Top predictions: awesome (0.09), great (0.07), fantastic (0.06)

  Python is a popular [MASK] language.
  Top predictions: programming (0.90), python (0.06), modeling (0.01)

======================================================================
DECODER-ONLY: Continue the text (sees only LEFT context)
======================================================================

Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


  She went to the...
  Continuation: hospital and was treated for a broken nose

Both `max_new_tokens` (=8) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


  The movie was...
  Continuation: released in Japan on May 7, 2016


  Python is a popular...
  Continuation: programming language, and it's easy to

The difference is dramatic:

Encoder-only (fill-mask): The model sees “to buy groceries” after the blank and predicts “store” or “supermarket” — it uses the right context to disambiguate.
Decoder-only (generation): The model only sees “She went to the” and must guess what comes next without knowing about groceries. It generates plausible continuations, but they might go in a completely different direction.

This is the key insight from Part 01 in action: bidirectional context helps you understand; left-to-right context helps you generate.

Figure 1:Bidirectional attention resolves ambiguity by seeing the full sentence. Causal attention must commit to an interpretation before seeing right context.

Round 2: Sentiment Classification¶

Now let’s compare the two architectures on a task that encoder-only models are specifically fine-tuned for — sentiment classification:

sentiment = pipeline("sentiment-analysis")  # Encoder-only (fine-tuned DistilBERT)
generator = pipeline("text-generation", model="gpt2")  # Decoder-only (no fine-tuning)

test_reviews = [
    "This restaurant has the best pasta I've ever tasted!",
    "The service was slow and the food was cold. Never coming back.",
    "Decent place, nothing extraordinary but gets the job done.",
]

print("=" * 70)
print("ENCODER-ONLY: Fine-tuned sentiment classifier")
print("=" * 70)
for review in test_reviews:
    result = sentiment(review)
    print(f"  {result[0]['label']:10s} ({result[0]['score']:.4f})  {review[:55]}")

print(f"\n{'=' * 70}")
print("DECODER-ONLY: GPT-2 zero-shot (prompting)")
print("=" * 70)
for review in test_reviews:
    prompt = f'Review: "{review}"\nSentiment (positive or negative):'
    result = generator(prompt, max_new_tokens=5, do_sample=False, pad_token_id=50256)
    generated = result[0]["generated_text"][len(prompt):]
    print(f"  Generated: {generated.strip():30s}  {review[:40]}...")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f.
Using a pipeline without specifying a model name and revision in production is not recommended.

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

======================================================================
ENCODER-ONLY: Fine-tuned sentiment classifier
======================================================================
  POSITIVE   (0.9998)  This restaurant has the best pasta I've ever tasted!
  NEGATIVE   (0.9997)  The service was slow and the food was cold. Never comin
  POSITIVE   (0.9989)  Decent place, nothing extraordinary but gets the job do

======================================================================
DECODER-ONLY: GPT-2 zero-shot (prompting)
======================================================================

Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

  Generated: "I'm so happy                   This restaurant has the best pasta I've ...

Both `max_new_tokens` (=5) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

  Generated: "I was disappointed with        The service was slow and the food was co...

  Generated: "I'm not sure                   Decent place, nothing extraordinary but ...

The encoder-only classifier gives clean, confident predictions with a single label and probability score. GPT-2, on the other hand, wasn’t trained for classification — it just continues the text. Without instruction tuning (which we’ll see in Week 8), small decoder-only models struggle with structured tasks like classification.

The Takeaway¶

This comparison illustrates a theme from Part 01:

Encoder-only models (RoBERTa, BERT) are purpose-built for understanding tasks. Fine-tuned on a specific task, they give clean, structured outputs with calibrated confidence scores.
Decoder-only models (GPT-2) are natural generators. They need additional engineering (instruction tuning, prompt design, scaling up) to match encoder-only performance on structured tasks — but with enough scale and tuning (GPT-4, Claude), they can do it all.

The right choice depends on your use case: if you need a fast, accurate classifier for production, fine-tune an encoder. If you need flexibility and general-purpose capability, reach for a larger decoder-only model.

Figure 2:A small fine-tuned encoder outperforms a small zero-shot decoder on structured tasks like sentiment classification.

Exercise 7.12: Design Your Own Comparison

Pick a sentence where bidirectional context would clearly change the prediction compared to left-only context. Here’s one to get you started:

“The bank by the river was covered in wildflowers.”

The word “bank” is ambiguous — financial institution or river bank? An encoder-only model sees “river” and “wildflowers” to the right and can disambiguate. A decoder-only model seeing only “The bank” doesn’t have that context yet.

Design 2-3 of your own ambiguous sentences and test them with both fill-mask and text-generation. Can you find cases where the bidirectional context makes a clear difference?

fill_mask = pipeline("fill-mask", model="distilbert-base-uncased")
generator = pipeline("text-generation", model="gpt2")

# Example:
masked = "The [MASK] near the river was covered in wildflowers."
prompt = "The"

# Your sentences here...
results = fill_mask(masked)
print("Fill-mask top 3:", [(r["token_str"], f"{r['score']:.2f}") for r in results[:3]])

result = generator(prompt, max_new_tokens=10, do_sample=False, pad_token_id=50256)
print("Continuation:", result[0]["generated_text"])

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Hugging Face pipelines make it easy to explore NLP tasks — sentiment analysis, NER, question answering, and text generation each take just a few lines of code
Model confidence scores are informative — high confidence usually means the input is clear-cut; low confidence signals ambiguity, mixed signals, or out-of-domain inputs
NER models use contextual embeddings to disambiguate entities — the same word (like “Amazon”) gets different labels depending on surrounding context
Extractive QA models read the context, not their memory — if the answer isn’t in the passage, the model will guess with low confidence or return a wrong span
Decoding parameters (temperature, top-k) control the creativity-coherence trade-off in text generation — low temperature for factual tasks, higher for creative tasks
Encoder-only models excel at structured understanding tasks because they see bidirectional context, while decoder-only models excel at generation because they naturally produce text left-to-right
Architecture choice matters in practice — a small fine-tuned encoder beats a small zero-shot decoder on classification, but large instruction-tuned decoders close that gap through scale and training

What’s Next¶

In Week 8, we’ll zoom out from individual models to the broader landscape of foundation models and modern LLMs. We’ll explore scaling laws (why bigger models are better), the spectrum of open vs. closed models, and how techniques like LoRA let you fine-tune massive models on a laptop. The hands-on experience you’ve built today — using pipelines, comparing architectures, and probing model behavior — sets the stage for working with the full-scale models that power today’s AI applications.