Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The Foundation Model Revolution

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


From Specialists to Generalists

Last week, we studied the three Transformer variants — encoder-only, decoder-only, and encoder-decoder — and explored when each one shines. We fine-tuned BERT for classification, used GPT for generation, and saw how T5 frames everything as text-to-text.

But here’s a question worth sitting with: BERT and GPT-5 are both Transformers. So why can GPT-5 write essays, solve math problems, and build entire applications, while BERT can only fill in blanks?

The answer isn’t the architecture — it’s the scale. What happened between 2018 and 2026 is one of the most remarkable stories in the history of AI: researchers discovered that if you take the same decoder-only Transformer architecture, make it dramatically larger, train it on dramatically more data, and throw dramatically more compute at it... qualitatively new capabilities emerge. Capabilities that nobody explicitly programmed.

This is the foundation model revolution, and understanding it is essential for everything we’ll do in the rest of this course — from working with LLM APIs to building agents.


The Foundation Model Timeline

Let’s trace the key milestones. The striking thing about this timeline is how fast things moved.

The evolution from task-specific pretrained models (2018) to today’s multi-provider foundation model ecosystem, all built on the same decoder-only Transformer architecture.

Figure 1:The evolution from task-specific pretrained models (2018) to today’s multi-provider foundation model ecosystem, all built on the same decoder-only Transformer architecture.

2018: The Starting Gun

BERT (340M parameters) and GPT (117M parameters) were published in the same year, representing competing bets on which half of the Transformer mattered more. Both were relatively small by today’s standards. The workflow was clear: pretrain a model on a large corpus, then fine-tune it on your specific task. You needed a separate fine-tuned model for sentiment analysis, a different one for NER, and yet another for question answering.

2019-2020: Scale Changes Everything

GPT-2 (1.5B parameters) showed that a larger decoder-only model could generate surprisingly coherent long-form text. But the real inflection point was GPT-3 (175B parameters) in 2020. GPT-3 demonstrated something no one had seen before: in-context learning. You didn’t need to fine-tune the model at all — you could just describe the task in the prompt, provide a few examples, and the model would perform it. This was a 1,000x increase in parameters from the original GPT, and it unlocked qualitatively different behavior.

2022-23: Mainstream Adoption

ChatGPT (late 2022) wrapped GPT-3.5 in a conversational interface and became the fastest-growing consumer application in history. GPT-4 (2023) added multimodal capabilities and stronger reasoning.

Meanwhile, a pivotal moment for the open model ecosystem was unfolding. In February 2023, Meta released LLaMA — a family of models from 7B to 65B parameters — under a restricted, case-by-case research license. Within a week, the model weights were leaked via a torrent on 4chan and spread rapidly through online AI communities. Meta filed DMCA takedowns, but the cat was out of the bag.

What happened next was remarkable. The research community seized on LLaMA as a foundation and built an explosion of derivatives: Stanford’s Alpaca (instruction-tuned for $600), Vicuna, Koala, Dolly, and thousands more. Researchers got these models running on consumer laptops, demonstrated that small fine-tuned models could rival much larger ones on specific tasks, and collectively proved that open models could be a viable alternative to closed APIs. Over 7,000 LLaMA derivatives appeared on Hugging Face within months.

Meta read the room and leaned in. In July 2023, they officially released Llama 2 under a far more permissive license, explicitly endorsing commercial use. This decision — turning an accidental leak into a deliberate open-model strategy — catalyzed the entire open model ecosystem we see today. Without LLaMA’s leak and the community’s response, the competitive open model landscape of Llama 4, DeepSeek, Qwen, and Mistral might look very different.

2025-26: The Current Landscape

Today we have a mature multi-provider ecosystem. On the closed side: GPT-5.4 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). On the open side: Qwen 3.5 and MiniMax M2.5 lead leaderboards, with Llama 4 (Meta), DeepSeek V3.2, and GLM-5 close behind. Many open models now use Mixture-of-Experts (MoE) architectures — Llama 4 Maverick has 400B total parameters but only activates 17B per token, making inference far more efficient. The gap between open and closed models has narrowed dramatically, and the choice between them is now driven by practical trade-offs rather than raw capability alone.

The key insight across this entire timeline? Same architecture. Same basic idea (predict the next token). Just... more. More parameters, more data, more compute. And that “more” turned out to be transformative.


Scaling Laws

Why Does Bigger Mean Better?

This observation — that scaling up produces better models — isn’t just an empirical accident. In 2020, Kaplan et al. at OpenAI published a landmark paper showing that language model performance follows remarkably predictable scaling laws. When you plot model loss against the number of parameters, the amount of training data, or the total compute on a log-log scale, you get approximately straight lines.

Language model loss decreases predictably as you scale up three levers: model size (N), dataset size (D), and compute (C). On a log-log plot, the relationship is approximately linear — a power law.

Figure 2:Language model loss decreases predictably as you scale up three levers: model size (N), dataset size (D), and compute (C). On a log-log plot, the relationship is approximately linear — a power law.

What this means in practice:

This predictability was transformative for the field. It meant that labs could invest billions of dollars in training larger models with reasonable confidence about what the result would be. You didn’t have to train a 175B model to know it would be better than a 13B model — the scaling curves told you in advance.

The Chinchilla Insight

There’s a subtlety, though. You have three levers to pull — model size (N), data (D), and compute (C) — and they’re related by the approximate relationship C ≈ 6 × N × D (where C is in FLOPs). Given a fixed compute budget, how should you allocate between a bigger model and more training data?

Early LLMs like GPT-3 went big on parameters but were undertrained — they didn’t see enough data relative to their size. In 2022, Hoffmann et al. at DeepMind published the “Chinchilla” paper, which showed that for a fixed compute budget, you should scale model size and data roughly equally — the principle of compute-optimal training. Their 70B parameter Chinchilla model, trained on more data, outperformed the much larger 280B Gopher model trained on less data.

This insight reshaped how models are trained. Llama 4, for example, uses Mixture-of-Experts with 400B+ total parameters trained on 15+ trillion tokens — far more data relative to model size than GPT-3 used.

Emergent Capabilities

Perhaps the most fascinating consequence of scaling is the emergence of emergent capabilities — abilities that seem to appear suddenly at certain scales. Small models can’t do chain-of-thought reasoning. Medium models do it poorly. Then, past a certain threshold, the ability appears as if switched on.

Examples of emergent capabilities include:

These emergent behaviors are why foundation models feel qualitatively different from their smaller predecessors. It’s not just “a little better at everything” — it’s “can do things the smaller model fundamentally couldn’t.”


Open vs. Closed Models

The foundation model landscape today is split into two camps, and understanding the trade-offs between them is critical for anyone building NLP applications.

Closed models offer the highest capability with minimal setup; open models offer control, privacy, and customizability. The right choice depends on your constraints.

Figure 3:Closed models offer the highest capability with minimal setup; open models offer control, privacy, and customizability. The right choice depends on your constraints.

Closed Models (API-Only)

Closed models — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro — are accessible only through an API. You send your text, the model processes it on the provider’s servers, and you get a response back. You never see the model’s weights.

When to choose closed:

Open Models (Weights Available)

Open models — Qwen 3.5, MiniMax M2.5, Llama 4, DeepSeek V3.2 — release their weights publicly (with varying licenses). You download the model and run it on your own hardware, or use tools like Ollama to run them locally with a single command.

When to choose open:

A Note on “Open”

The term “open” in the model world is nuanced. Some models (like Llama 4) release weights but have restrictive licenses for commercial use. Others (like DeepSeek V3.2, released under the MIT license) are fully permissive. Very few release training data and full training code — the way traditional open-source software does. When evaluating an “open” model, always check: What exactly is open? Weights? Code? Data? And under what license?

# Let's look at some open foundation models on the Hugging Face Hub
from huggingface_hub import model_info

# A selection of leading open models (as of early 2026)
model_names = [
    "Qwen/Qwen3.5-397B-A17B",
    "Qwen/Qwen3-32B",
    "deepseek-ai/DeepSeek-R1-0528",
    "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "microsoft/phi-4",
]

print(f"{'Model':<50} {'Released':>12} {'Downloads':>12} {'Likes':>8}")
print("-" * 87)

for name in model_names:
    try:
        m = model_info(name)
        released = m.created_at.strftime("%Y-%m-%d") if m.created_at else "N/A"
        downloads = f"{m.downloads:,}" if m.downloads else "N/A"
        likes = f"{m.likes:,}" if m.likes else "N/A"
        print(f"{m.id:<50} {released:>12} {downloads:>12} {likes:>8}")
    except Exception as e:
        print(f"{name:<50} {'(error)':>12}")
Model                                                  Released    Downloads    Likes
---------------------------------------------------------------------------------------
Qwen/Qwen3.5-397B-A17B                               2026-02-16    1,807,452    1,372
Qwen/Qwen3-32B                                       2025-04-27    4,583,432      670
deepseek-ai/DeepSeek-R1-0528                         2025-05-28      899,271    2,408
meta-llama/Llama-4-Scout-17B-16E-Instruct            2025-04-02      239,588    1,248
microsoft/phi-4                                      2024-12-11      948,265    2,220

These are all open models you can download and run locally (or via Ollama). The closed models (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) don’t appear on the Hub because their weights aren’t available — you interact with them only through APIs, which we’ll explore in Parts 03 and 04 of this week.


Wrap-Up

Key Takeaways

What’s Next

In Part 02, we’ll explore fine-tuning — the techniques that let you take a foundation model and customize it for your specific domain or task. We’ll cover full fine-tuning vs. parameter-efficient methods like LoRA and QLoRA, instruction tuning, and alignment with RLHF. This is where foundation models go from general-purpose to your-purpose.