The Foundation Model Revolution
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
L07.01: Transformer model variants — encoder-only, decoder-only, encoder-decoder architectures and their pretraining objectives
L07.02: The Hugging Face ecosystem — pipelines, model hub, and datasets library
Outcomes
Trace the evolution from task-specific pretrained models (BERT, GPT-2) to general-purpose foundation models
Explain the core intuition behind scaling laws and why larger models exhibit emergent capabilities
Compare open vs. closed models along dimensions of capability, cost, privacy, and licensing
Make informed decisions about when to use open vs. closed models for a given use case
References
HF Chapter 1: Transformer Models (LLM sections)
Kaplan et al. (2020) — Scaling Laws for Neural Language Models
Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models (Chinchilla)
From Specialists to Generalists¶
Last week, we studied the three Transformer variants — encoder-only, decoder-only, and encoder-decoder — and explored when each one shines. We fine-tuned BERT for classification, used GPT for generation, and saw how T5 frames everything as text-to-text.
But here’s a question worth sitting with: BERT and GPT-5 are both Transformers. So why can GPT-5 write essays, solve math problems, and build entire applications, while BERT can only fill in blanks?
The answer isn’t the architecture — it’s the scale. What happened between 2018 and 2026 is one of the most remarkable stories in the history of AI: researchers discovered that if you take the same decoder-only Transformer architecture, make it dramatically larger, train it on dramatically more data, and throw dramatically more compute at it... qualitatively new capabilities emerge. Capabilities that nobody explicitly programmed.
This is the foundation model revolution, and understanding it is essential for everything we’ll do in the rest of this course — from working with LLM APIs to building agents.
The Foundation Model Timeline¶
Let’s trace the key milestones. The striking thing about this timeline is how fast things moved.
Figure 1:The evolution from task-specific pretrained models (2018) to today’s multi-provider foundation model ecosystem, all built on the same decoder-only Transformer architecture.
2018: The Starting Gun¶
BERT (340M parameters) and GPT (117M parameters) were published in the same year, representing competing bets on which half of the Transformer mattered more. Both were relatively small by today’s standards. The workflow was clear: pretrain a model on a large corpus, then fine-tune it on your specific task. You needed a separate fine-tuned model for sentiment analysis, a different one for NER, and yet another for question answering.
2019-2020: Scale Changes Everything¶
GPT-2 (1.5B parameters) showed that a larger decoder-only model could generate surprisingly coherent long-form text. But the real inflection point was GPT-3 (175B parameters) in 2020. GPT-3 demonstrated something no one had seen before: in-context learning. You didn’t need to fine-tune the model at all — you could just describe the task in the prompt, provide a few examples, and the model would perform it. This was a 1,000x increase in parameters from the original GPT, and it unlocked qualitatively different behavior.
2022-23: Mainstream Adoption¶
ChatGPT (late 2022) wrapped GPT-3.5 in a conversational interface and became the fastest-growing consumer application in history. GPT-4 (2023) added multimodal capabilities and stronger reasoning.
Meanwhile, a pivotal moment for the open model ecosystem was unfolding. In February 2023, Meta released LLaMA — a family of models from 7B to 65B parameters — under a restricted, case-by-case research license. Within a week, the model weights were leaked via a torrent on 4chan and spread rapidly through online AI communities. Meta filed DMCA takedowns, but the cat was out of the bag.
What happened next was remarkable. The research community seized on LLaMA as a foundation and built an explosion of derivatives: Stanford’s Alpaca (instruction-tuned for $600), Vicuna, Koala, Dolly, and thousands more. Researchers got these models running on consumer laptops, demonstrated that small fine-tuned models could rival much larger ones on specific tasks, and collectively proved that open models could be a viable alternative to closed APIs. Over 7,000 LLaMA derivatives appeared on Hugging Face within months.
Meta read the room and leaned in. In July 2023, they officially released Llama 2 under a far more permissive license, explicitly endorsing commercial use. This decision — turning an accidental leak into a deliberate open-model strategy — catalyzed the entire open model ecosystem we see today. Without LLaMA’s leak and the community’s response, the competitive open model landscape of Llama 4, DeepSeek, Qwen, and Mistral might look very different.
2025-26: The Current Landscape¶
Today we have a mature multi-provider ecosystem. On the closed side: GPT-5.4 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). On the open side: Qwen 3.5 and MiniMax M2.5 lead leaderboards, with Llama 4 (Meta), DeepSeek V3.2, and GLM-5 close behind. Many open models now use Mixture-of-Experts (MoE) architectures — Llama 4 Maverick has 400B total parameters but only activates 17B per token, making inference far more efficient. The gap between open and closed models has narrowed dramatically, and the choice between them is now driven by practical trade-offs rather than raw capability alone.
The key insight across this entire timeline? Same architecture. Same basic idea (predict the next token). Just... more. More parameters, more data, more compute. And that “more” turned out to be transformative.
Scaling Laws¶
Why Does Bigger Mean Better?¶
This observation — that scaling up produces better models — isn’t just an empirical accident. In 2020, Kaplan et al. at OpenAI published a landmark paper showing that language model performance follows remarkably predictable scaling laws. When you plot model loss against the number of parameters, the amount of training data, or the total compute on a log-log scale, you get approximately straight lines.
Figure 2:Language model loss decreases predictably as you scale up three levers: model size (N), dataset size (D), and compute (C). On a log-log plot, the relationship is approximately linear — a power law.
What this means in practice:
Double the parameters → loss decreases by a predictable amount
Double the training data → loss decreases by a predictable amount
Double the compute budget → loss decreases by a predictable amount
This predictability was transformative for the field. It meant that labs could invest billions of dollars in training larger models with reasonable confidence about what the result would be. You didn’t have to train a 175B model to know it would be better than a 13B model — the scaling curves told you in advance.
The Chinchilla Insight¶
There’s a subtlety, though. You have three levers to pull — model size (N), data (D), and compute (C) — and they’re related by the approximate relationship C ≈ 6 × N × D (where C is in FLOPs). Given a fixed compute budget, how should you allocate between a bigger model and more training data?
Early LLMs like GPT-3 went big on parameters but were undertrained — they didn’t see enough data relative to their size. In 2022, Hoffmann et al. at DeepMind published the “Chinchilla” paper, which showed that for a fixed compute budget, you should scale model size and data roughly equally — the principle of compute-optimal training. Their 70B parameter Chinchilla model, trained on more data, outperformed the much larger 280B Gopher model trained on less data.
This insight reshaped how models are trained. Llama 4, for example, uses Mixture-of-Experts with 400B+ total parameters trained on 15+ trillion tokens — far more data relative to model size than GPT-3 used.
Emergent Capabilities¶
Perhaps the most fascinating consequence of scaling is the emergence of emergent capabilities — abilities that seem to appear suddenly at certain scales. Small models can’t do chain-of-thought reasoning. Medium models do it poorly. Then, past a certain threshold, the ability appears as if switched on.
Examples of emergent capabilities include:
In-context learning — performing new tasks from examples in the prompt (few-shot prompting)
Chain-of-thought reasoning — solving multi-step problems by “thinking out loud”
Code generation — writing working programs from natural language descriptions
Instruction following — understanding and executing complex, multi-part instructions
These emergent behaviors are why foundation models feel qualitatively different from their smaller predecessors. It’s not just “a little better at everything” — it’s “can do things the smaller model fundamentally couldn’t.”
Open vs. Closed Models¶
The foundation model landscape today is split into two camps, and understanding the trade-offs between them is critical for anyone building NLP applications.
Figure 3:Closed models offer the highest capability with minimal setup; open models offer control, privacy, and customizability. The right choice depends on your constraints.
Closed Models (API-Only)¶
Closed models — GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro — are accessible only through an API. You send your text, the model processes it on the provider’s servers, and you get a response back. You never see the model’s weights.
When to choose closed:
You need frontier capabilities (the best reasoning, the most nuanced generation)
You want to get started quickly without infrastructure investment
Your volume is low enough that per-token pricing is manageable
You don’t have strict data residency requirements
Open Models (Weights Available)¶
Open models — Qwen 3.5, MiniMax M2.5, Llama 4, DeepSeek V3.2 — release their weights publicly (with varying licenses). You download the model and run it on your own hardware, or use tools like Ollama to run them locally with a single command.
When to choose open:
Privacy is critical — data can’t leave your environment (healthcare, finance, legal)
You need to fine-tune the model on domain-specific data
You have high volume where per-token API pricing would be prohibitive
You need predictable latency and don’t want to depend on an external service
A Note on “Open”¶
The term “open” in the model world is nuanced. Some models (like Llama 4) release weights but have restrictive licenses for commercial use. Others (like DeepSeek V3.2, released under the MIT license) are fully permissive. Very few release training data and full training code — the way traditional open-source software does. When evaluating an “open” model, always check: What exactly is open? Weights? Code? Data? And under what license?
# Let's look at some open foundation models on the Hugging Face Hub
from huggingface_hub import model_info
# A selection of leading open models (as of early 2026)
model_names = [
"Qwen/Qwen3.5-397B-A17B",
"Qwen/Qwen3-32B",
"deepseek-ai/DeepSeek-R1-0528",
"meta-llama/Llama-4-Scout-17B-16E-Instruct",
"microsoft/phi-4",
]
print(f"{'Model':<50} {'Released':>12} {'Downloads':>12} {'Likes':>8}")
print("-" * 87)
for name in model_names:
try:
m = model_info(name)
released = m.created_at.strftime("%Y-%m-%d") if m.created_at else "N/A"
downloads = f"{m.downloads:,}" if m.downloads else "N/A"
likes = f"{m.likes:,}" if m.likes else "N/A"
print(f"{m.id:<50} {released:>12} {downloads:>12} {likes:>8}")
except Exception as e:
print(f"{name:<50} {'(error)':>12}")Model Released Downloads Likes
---------------------------------------------------------------------------------------
Qwen/Qwen3.5-397B-A17B 2026-02-16 1,807,452 1,372
Qwen/Qwen3-32B 2025-04-27 4,583,432 670
deepseek-ai/DeepSeek-R1-0528 2025-05-28 899,271 2,408
meta-llama/Llama-4-Scout-17B-16E-Instruct 2025-04-02 239,588 1,248
microsoft/phi-4 2024-12-11 948,265 2,220
These are all open models you can download and run locally (or via Ollama). The closed models (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) don’t appear on the Hub because their weights aren’t available — you interact with them only through APIs, which we’ll explore in Parts 03 and 04 of this week.
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In Part 02, we’ll explore fine-tuning — the techniques that let you take a foundation model and customize it for your specific domain or task. We’ll cover full fine-tuning vs. parameter-efficient methods like LoRA and QLoRA, instruction tuning, and alignment with RLHF. This is where foundation models go from general-purpose to your-purpose.