Transformer Model Variants

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L06.01: The Transformer architecture — self-attention, multi-head attention, positional encoding, masking, and transformer blocks
L06.02: Attention from Scratch — hands-on implementation in PyTorch

Outcomes

Compare the three major transformer variants (encoder-only, decoder-only, encoder-decoder) in terms of attention patterns, pretraining objectives, and task alignment
Explain masked language modeling and how bidirectional context enables understanding tasks
Explain causal language modeling and how autoregressive generation works in decoder-only models
Describe how encoder-decoder models combine both mechanisms for sequence-to-sequence tasks
Select the appropriate architecture variant for a given NLP task and justify the choice

References

J&M Chapter 9: Masked Language Models
J&M Chapter 7: Large Language Models
HF Chapter 1: Transformer Models
Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers
Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
Raffel et al. (2020) — T5: Exploring the Limits of Transfer Learning
Kaplan et al. (2020) — Scaling Laws for Neural Language Models

One Architecture, Three Philosophies¶

Last week, we built the Transformer from the ground up — self-attention, multi-head attention, residual connections, the works. The original Transformer was designed for machine translation, and it used both an encoder and a decoder working together.

But here’s the thing: BERT doesn’t have a decoder. GPT doesn’t have an encoder. T5 uses both. Yet all three are called “Transformers.” How can that be?

The answer is surprisingly simple. The original Transformer has two halves, and researchers discovered that you can take either half (or both) and pretrain it on a massive corpus to create a powerful language model. The choice of which half you use determines what your model is naturally good at:

Use only the encoder → a model that excels at understanding text (classification, NER, similarity)
Use only the decoder → a model that excels at generating text (completions, conversations, creative writing)
Use both → a model that excels at transforming one text into another (translation, summarization)

This lecture explores each of these three families. We’ll see how a single architectural choice — which half of the Transformer you keep, and what you mask — leads to fundamentally different capabilities.

Figure 1:The original Transformer splits into three model families, each defined by which components are kept and what pretraining objective is used.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

Encoder-Only Models¶

The Key Idea: Bidirectional Context¶

Recall from Week 6 that the encoder side of the Transformer uses bidirectional attention — every token can attend to every other token in the sequence. There is no causal mask. When processing the word “bank” in “I sat by the river bank,” the model sees both the left context (“I sat by the river”) and the right context simultaneously. This is incredibly powerful for tasks that require understanding the meaning of text.

Encoder-only models take just this encoder stack and throw away the decoder entirely. The result is a model that builds rich, contextual representations of input text — representations that capture meaning in both directions.

But here’s a problem: if the model can see everything at once, how do we train it? We can’t use next-token prediction because the model would simply look ahead and copy the answer. We need a different pretraining objective.

Masked Language Modeling¶

The solution is Masked Language Modeling — a clever “fill-in-the-blank” game. During training, we randomly mask out about 15% of the tokens in each input and ask the model to predict them from the surrounding context.

For example, given the input:

“The cat [MASK] on the [MASK] because it was tired”

The model must predict that the first blank is “sat” and the second is “mat” — using all the surrounding context from both directions. This forces the model to build deep, bidirectional representations that capture the relationships between words.

Figure 2:Each [MASK] token is predicted using context from both directions — the hallmark of encoder-only pretraining.

Let’s visualize the difference between the attention patterns. In the encoder, every token sees every other token. Compare this with the causal mask we saw last week, where each token can only see tokens that came before it:

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

seq_len = 5
tokens = ["The", "cat", "sat", "on", "mat"]

# Encoder-only: bidirectional (full attention)
bidir_mask = torch.ones(seq_len, seq_len)
axes[0].imshow(bidir_mask, cmap="Blues", vmin=0, vmax=1)
axes[0].set_title("Encoder-Only\n(Bidirectional)", fontsize=13)
axes[0].set_xticks(range(seq_len))
axes[0].set_yticks(range(seq_len))
axes[0].set_xticklabels(tokens, fontsize=10)
axes[0].set_yticklabels(tokens, fontsize=10)
axes[0].set_xlabel("Attends to")
axes[0].set_ylabel("Token")

# Decoder-only: causal (lower triangular)
causal_mask = torch.tril(torch.ones(seq_len, seq_len))
axes[1].imshow(causal_mask, cmap="Oranges", vmin=0, vmax=1)
axes[1].set_title("Decoder-Only\n(Causal)", fontsize=13)
axes[1].set_xticks(range(seq_len))
axes[1].set_yticks(range(seq_len))
axes[1].set_xticklabels(tokens, fontsize=10)
axes[1].set_yticklabels(tokens, fontsize=10)
axes[1].set_xlabel("Attends to")
axes[1].set_ylabel("Token")

plt.tight_layout()
plt.show()

The blue grid on the left is all ones — every token can attend to every other token. The orange triangular matrix on the right enforces causal ordering — token $i$ can only attend to tokens $1, 2, \ldots, i$ .

This single difference — the shape of the attention mask — is what separates the two model families.

From BERT to RoBERTa¶

The first major encoder-only model was BERT (Bidirectional Encoder Representations from Transformers), published by Devlin et al. at Google in 2018. BERT was a landmark result: it achieved state-of-the-art on 11 NLP benchmarks simultaneously, and it demonstrated the power of bidirectional pretraining.

BERT came in two sizes:

Model	Layers	Hidden Size	Attention Heads	Parameters
BERT-base	12	768	12	110M
BERT-large	24	1024	16	340M

BERT’s pretraining used two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The NSP task trained the model to predict whether two sentences appeared consecutively in the original text. The idea was to help BERT understand relationships between sentences.

However, subsequent research found that BERT’s recipe could be significantly improved. In 2019, Liu et al. published RoBERTa (Robustly Optimized BERT Pretraining Approach), which kept BERT’s architecture but made several training improvements:

Dropped the NSP objective — it turned out to be unnecessary, and even slightly harmful
Trained on more data — 160GB vs. BERT’s 16GB of text
Trained for longer — more steps with larger batches
Dynamic masking — instead of masking the same tokens every epoch, the masking pattern changes each time the model sees a sequence

The result? RoBERTa matched or exceeded BERT’s performance on every benchmark, demonstrating that BERT was significantly undertrained. This makes RoBERTa the canonical encoder-only model we reference today.

What Encoder-Only Models Excel At¶

Because encoder-only models build bidirectional representations, they are naturally suited for tasks that require understanding the full context of a piece of text:

Text classification — sentiment analysis, spam detection, topic labeling (use the [CLS] token representation as input to a classification head)
Named Entity Recognition — label each token as person, organization, location, etc. (use per-token representations)
Extractive Question Answering — given a question and a passage, identify the span of text that answers the question (predict start and end positions)
Semantic similarity — determine how similar two sentences are (compare their [CLS] representations)

What they cannot do well is generate text. Because the model sees all positions simultaneously during training, it has no natural mechanism for producing tokens one at a time. You can’t ask RoBERTa to “continue this sentence” — it was trained to fill in blanks, not to write left to right.

Decoder-Only Models¶

The Key Idea: Causal Language Modeling¶

Decoder-only models take the opposite approach: they use only the decoder stack with causal masking, where each token can only attend to itself and all preceding tokens. This is the same lower-triangular mask we studied in Week 6.

The pretraining objective is causal language modeling — predicting the next token given all previous tokens. Given a sequence of tokens $x_1, x_2, \ldots, x_n$ , the model learns to predict:

P(x_t \mid x_1, x_2, \ldots, x_{t-1})

(1)

for every position $t$ in the sequence. This is the classic language modeling task, the same one that N-gram models attempted decades ago — but now powered by the full expressiveness of the Transformer architecture.

Figure 3:Each token predicts the next token using only left context — the hallmark of decoder-only pretraining.

The GPT Lineage¶

The decoder-only story begins with GPT (Generative Pre-trained Transformer), published by Radford et al. at OpenAI in 2018 — the same year as BERT. The two papers represented competing bets on which half of the Transformer would prove more useful.

The GPT lineage tracks the rapid scaling of decoder-only models:

Model	Year	Parameters	Key Innovation
GPT	2018	117M	Showed that pretraining a decoder works for downstream tasks
GPT-2	2019	1.5B	Demonstrated coherent long-form text generation
GPT-3	2020	175B	Discovered in-context learning and few-shot prompting
GPT-4	2023	Undisclosed	Multimodal input, advanced reasoning

Each generation grew dramatically in scale, and with that scale came qualitatively new capabilities. GPT-3 was particularly significant because it showed that a sufficiently large decoder-only model could perform tasks it was never explicitly trained for — simply by providing a few examples in the prompt. This in-context learning ability changed the field overnight.

Figure 4:The rapid scaling of decoder-only models from GPT (117M parameters) to GPT-4, with qualitatively new capabilities emerging at each scale.

Why Causal Masking Enables Generation¶

The beauty of causal masking is that it makes generation trivially straightforward. During training, the model learns to predict each next token from its predecessors. At inference time, we can generate text by:

Feed in a prompt: $x_1, x_2, \ldots, x_k$
The model predicts a distribution over the next token: $P(x_{k+1} \mid x_1, \ldots, x_k)$
Sample or pick the most likely token
Append it to the sequence and repeat

This autoregressive process generates text one token at a time, and it works seamlessly because the model was trained in exactly this left-to-right fashion.

The Surprising Discovery: Understanding Through Generation¶

Here’s what caught the field by surprise: with enough scale, decoder-only models can also understand text. GPT-3 showed that you can do text classification, NER, translation, and many other “understanding” tasks simply by framing them as text generation problems.

Want sentiment analysis? Just prompt the model:

“Review: This movie was fantastic! Sentiment:”

And the model generates “Positive.” No fine-tuning needed — just a well-crafted prompt.

This prompt-based approach to NLP tasks is why decoder-only models have become so dominant. Instead of needing separate, fine-tuned models for each task (as with BERT/RoBERTa), a single large decoder-only model can handle almost anything through appropriate prompting. Models like GPT-4, Claude, and LLaMA are all decoder-only Transformers.

Why Decoder-Only Won (So Far)¶

The dominance of decoder-only models is one of the most important developments in modern NLP — and it wasn’t obvious in advance. In 2018, BERT and GPT were published in the same year, representing competing bets on which half of the Transformer mattered more. BERT won the initial benchmarks convincingly. But within five years, the field had decisively shifted to decoder-only. Why?

Several factors converged:

Training signal density. This is perhaps the most fundamental advantage. In causal language modeling, every token in the training data provides a learning signal — the model must predict token $t$ from tokens $1, \ldots, t{-}1$ . With MLM (the encoder-only objective), only the ~15% of randomly masked tokens contribute to the loss. The remaining 85% provide context but don’t directly drive weight updates. For the same training data, a decoder-only model extracts roughly 6–7x more gradient signal per pass through the corpus. When you’re training on trillions of tokens, that efficiency gap is enormous.

Predictable scaling. In 2020, Kaplan et al. at OpenAI discovered that decoder-only model performance follows remarkably clean power laws: as you increase model size, dataset size, or compute, the loss decreases along smooth, predictable curves. This was transformative — it meant labs could invest billions of dollars in larger models with reasonable confidence about the outcome. These scaling laws weren’t established for encoder-only or encoder-decoder models at comparable scales, so investment naturally concentrated on the architecture with the clearest roadmap.

The output bottleneck of encoders. Consider what happens when you scale an encoder-only model to 175B parameters. Its output is still a fixed-dimensional vector — a 768- or 1024-dimensional [CLS] representation — that must be routed through a small, task-specific classification head. Even a massive encoder can only “answer” in the limited vocabulary of that head (e.g., “positive” vs. “negative”). A decoder-only model’s output is generated text, which can express anything: labels, translations, reasoning chains, code, poetry. This means decoder-only models can demonstrate new capabilities simply by generating new kinds of text, while encoder-only models are structurally limited to the tasks their heads were designed for.

Next-token prediction is a harder (and richer) objective. MLM asks: “given all the words around this blank, what word goes here?” The bidirectional context heavily constrains the answer — there’s usually only one or a few words that fit. Causal language modeling asks: “given only what came before, what comes next?” This is a fundamentally harder problem. With only left context, the model must build deeper internal models of language structure, world knowledge, and reasoning to make good predictions. It can’t “cheat” by looking at words to the right. At scale, this harder objective forces the model to develop richer internal representations.

Generation naturally supports in-context learning. When a user puts examples in a prompt, a decoder-only model processes them left-to-right, building up an internal representation of “what task is being demonstrated.” Each subsequent prediction is conditioned on all the examples that came before. This sequential conditioning is the mechanism behind in-context learning — and it simply doesn’t exist in encoder-only models, which process all tokens simultaneously and have no concept of “now generate an answer to what came before.”

Architectural simplicity. A decoder-only model is a single stack of Transformer blocks with a causal mask. No cross-attention layers, no separate encoder and decoder to coordinate. This simplicity has real engineering benefits: easier parallelization across GPUs, more straightforward inference optimization (particularly KV-caching, where previously computed key-value pairs are reused as each new token is generated), and simpler codebases to maintain and debug at scale.

These factors are mutually reinforcing. Training efficiency → more investment in scaling → scaling reveals emergent capabilities unique to the generation paradigm → more interest → more investment. By the time the field realized how powerful scaled decoder-only models could be, the gap in research attention (and funding) had become self-perpetuating.

Exercise 7.2: Attention Masks

Consider the sentence “I love NLP” (3 tokens).

(a) Write out the $3 \times 3$ attention mask matrix for an encoder-only model (bidirectional attention). Which positions can token 2 (“love”) attend to?

(b) Write out the $3 \times 3$ attention mask matrix for a decoder-only model (causal attention). Which positions can token 2 (“love”) attend to now?

(c) In the encoder-only model, token 2 sees the word “NLP” when building its representation. In the decoder-only model, it does not. How might this affect the model’s ability to understand that “love” in this sentence refers to enjoying something (rather than romantic love)? Give a scenario where the right context would change the interpretation of a word.

import torch

# Verify your answers:
# Bidirectional mask (encoder-only)
bidir = torch.ones(3, 3)
print("Encoder-only mask:\n", bidir)

# Causal mask (decoder-only)
causal = torch.tril(torch.ones(3, 3))
print("Decoder-only mask:\n", causal)

Encoder-Decoder Models¶

The Key Idea: Understand Then Generate¶

Encoder-decoder models keep the full original Transformer architecture: an encoder that processes the input with bidirectional attention, and a decoder that generates the output autoregressively with causal masking. The two halves are connected through cross-attention — the decoder attends to the encoder’s output representations when generating each token.

This architecture is a natural fit for tasks where the input and output are different sequences. The encoder builds a rich, bidirectional understanding of the input, and the decoder uses that understanding to generate a new sequence.

Figure 5:The encoder builds bidirectional representations of the input; the decoder generates the output autoregressively, attending to the encoder’s representations via cross-attention.

T5: Everything is Text-to-Text¶

The most influential encoder-decoder model is T5 (Text-to-Text Transfer Transformer), published by Raffel et al. at Google in 2020. T5’s key innovation was a unifying idea: every NLP task can be framed as transforming one text string into another.

Instead of adding task-specific heads (like BERT does for classification vs. NER vs. QA), T5 prepends a task prefix to the input and generates the output as text:

Task	Input	Output
Translation	“translate English to German: That is good”	“Das ist gut”
Summarization	“summarize: The lengthy article about...”	“Article discusses...”
Classification	“classify sentiment: This movie was great!”	“positive”
QA	“question: What is the capital? context: France’s capital is Paris.”	“Paris”

This text-to-text framing is elegant: one model, one format, many tasks. T5 was pretrained using a variation of masked language modeling called span corruption — instead of masking individual tokens, it masks contiguous spans of text, and the model must generate the missing spans.

BART: A Denoising Perspective¶

Another notable encoder-decoder model is BART (Bidirectional and Auto-Regressive Transformer), published by Lewis et al. at Facebook in 2019. BART takes a different angle on pretraining: it corrupts the input text in various ways (masking, deletion, permutation, rotation) and trains the model to reconstruct the original. This denoising autoencoder approach makes BART particularly strong at text generation tasks like abstractive summarization.

When to Use Encoder-Decoder¶

Encoder-decoder models shine when:

The input and output are structurally different — different lengths, different languages, different formats
You need strong understanding of the input (bidirectional encoder) combined with fluent generation of the output (autoregressive decoder)
The task naturally decomposes into “read this” then “write that” — machine translation, summarization, generative QA

The trade-off is complexity: encoder-decoder models have roughly twice the parameters of a single-stack model at the same depth (because you’re maintaining two separate stacks plus cross-attention layers). As decoder-only models have grown powerful enough to handle many of these tasks through prompting alone, the practical advantages of encoder-decoder models have narrowed. But for tasks like translation, where the structural separation between input and output is clear, they remain a strong choice.

The Big Picture¶

Let’s bring everything together. The three Transformer variants are defined by two choices: which components of the original architecture you keep, and what pretraining objective you use.

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

tokens = ["T₁", "T₂", "T₃", "T₄"]
n = len(tokens)

# Encoder-only: bidirectional
bidir = torch.ones(n, n)
axes[0].imshow(bidir, cmap="Blues", vmin=0, vmax=1.3)
axes[0].set_title("Encoder-Only\n(BERT, RoBERTa)", fontsize=12, fontweight="bold")
axes[0].set_xticks(range(n)); axes[0].set_yticks(range(n))
axes[0].set_xticklabels(tokens); axes[0].set_yticklabels(tokens)

# Decoder-only: causal
causal = torch.tril(torch.ones(n, n))
im2 = axes[1].imshow(causal, cmap="Oranges", vmin=0, vmax=1.3)
axes[1].set_title("Decoder-Only\n(GPT, LLaMA, Claude)", fontsize=12, fontweight="bold")
axes[1].set_xticks(range(n)); axes[1].set_yticks(range(n))
axes[1].set_xticklabels(tokens); axes[1].set_yticklabels(tokens)

# Encoder-decoder: show both
# Left half bidirectional (encoder attending to input), right half causal (decoder)
enc_dec = torch.ones(n, n)
for i in range(n):
    for j in range(n):
        if j > i:
            # Upper triangle gets partial shading to indicate cross-attention
            enc_dec[i, j] = 0.5
axes[2].imshow(enc_dec, cmap="Greens", vmin=0, vmax=1.3)
axes[2].set_title("Encoder-Decoder\n(T5, BART)", fontsize=12, fontweight="bold")
axes[2].set_xticks(range(n)); axes[2].set_yticks(range(n))
axes[2].set_xticklabels(tokens); axes[2].set_yticklabels(tokens)

for ax in axes:
    ax.set_xlabel("Attends to →")
    ax.set_ylabel("← Token")

plt.tight_layout()
plt.show()

Architecture Comparison¶

Table 1:Transformer Model Variants Compared

	Encoder-Only	Decoder-Only	Encoder-Decoder
Architecture	Encoder stack only	Decoder stack only	Both stacks + cross-attention
Attention	Bidirectional (full)	Causal (left-to-right)	Bidirectional encoder, causal decoder
Pretraining	Masked Language Modeling	Next-token prediction	Span corruption or denoising
Strengths	Understanding, classification, extraction	Generation, few-shot, in-context learning	Sequence-to-sequence transformation
Limitations	Cannot generate text	Misses right context at each position	More parameters, higher complexity
Canonical Models	BERT, RoBERTa, DeBERTa	GPT-3/4, LLaMA, Claude, Gemini	T5, BART, mBART
Use When...	You need to classify, extract, or compare	You need to generate or converse	Input → different output (translate, summarize)

Figure 6:A practical decision tree for choosing the right transformer architecture based on your NLP task.

The Trend: Decoder-Only Dominance¶

If you look at the most capable models today — GPT-4, Claude, LLaMA, Gemini — they are all decoder-only. This doesn’t mean encoder-only and encoder-decoder models are obsolete. Rather, the landscape has evolved:

Encoder-only models (RoBERTa, DeBERTa) are still the go-to for efficient, task-specific deployments. If you need a fast sentiment classifier or NER system, fine-tuning a relatively small encoder-only model is far more cost-effective than running every request through a massive LLM.
Encoder-decoder models (T5, BART) remain competitive for structured transformation tasks like translation and summarization, especially in research settings.
Decoder-only models dominate when you want a general-purpose model that handles many tasks through prompting, or when you need open-ended generation.

The key takeaway is that architecture choice is a design decision driven by your task, your constraints, and your deployment context. Understanding all three variants lets you make that decision wisely.

Wrap-Up¶

Key Takeaways¶

Key Takeaways

The original Transformer has two halves — an encoder and a decoder — and modern models use one, the other, or both, creating three distinct families with different strengths
Encoder-only models (RoBERTa, BERT) use bidirectional attention and are pretrained with Masked Language Modeling — they excel at understanding tasks like classification, NER, and extractive QA
RoBERTa improved on BERT by training longer, on more data, with dynamic masking, and by dropping the unnecessary Next Sentence Prediction objective
Decoder-only models (GPT, LLaMA, Claude) use causal attention and are pretrained with next-token prediction — they excel at text generation and, at sufficient scale, can handle virtually any task through prompting
Encoder-decoder models (T5, BART) combine bidirectional understanding with autoregressive generation — they are naturally suited for sequence-to-sequence tasks like translation and summarization
The attention mask is the key differentiator: bidirectional (see everything) vs. causal (see only the past) — this single design choice determines what tasks the model can perform
Architecture choice is a design decision — encoder-only for efficient task-specific deployment, decoder-only for general-purpose versatility, encoder-decoder for structured transformations

What’s Next¶

In Part 02, we’ll get hands-on with the Hugging Face ecosystem — the toolkit that makes all three of these model families accessible through a unified Python API. You’ll learn to load pretrained models and tokenizers from the Hugging Face Hub, use high-level pipelines for rapid prototyping, and work with the Datasets library for loading and processing data. The conceptual understanding of model variants you’ve built today will directly inform which models you reach for and why.