Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Transformer Model Variants

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


One Architecture, Three Philosophies

Last week, we built the Transformer from the ground upself-attention, multi-head attention, residual connections, the works. The original Transformer was designed for machine translation, and it used both an encoder and a decoder working together.

But here’s the thing: BERT doesn’t have a decoder. GPT doesn’t have an encoder. T5 uses both. Yet all three are called “Transformers.” How can that be?

The answer is surprisingly simple. The original Transformer has two halves, and researchers discovered that you can take either half (or both) and pretrain it on a massive corpus to create a powerful language model. The choice of which half you use determines what your model is naturally good at:

This lecture explores each of these three families. We’ll see how a single architectural choice — which half of the Transformer you keep, and what you mask — leads to fundamentally different capabilities.

The original Transformer splits into three model families, each defined by which components are kept and what pretraining objective is used.

Figure 1:The original Transformer splits into three model families, each defined by which components are kept and what pretraining objective is used.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

Encoder-Only Models

The Key Idea: Bidirectional Context

Recall from Week 6 that the encoder side of the Transformer uses bidirectional attention — every token can attend to every other token in the sequence. There is no causal mask. When processing the word “bank” in “I sat by the river bank,” the model sees both the left context (“I sat by the river”) and the right context simultaneously. This is incredibly powerful for tasks that require understanding the meaning of text.

Encoder-only models take just this encoder stack and throw away the decoder entirely. The result is a model that builds rich, contextual representations of input text — representations that capture meaning in both directions.

But here’s a problem: if the model can see everything at once, how do we train it? We can’t use next-token prediction because the model would simply look ahead and copy the answer. We need a different pretraining objective.

Masked Language Modeling

The solution is Masked Language Modeling — a clever “fill-in-the-blank” game. During training, we randomly mask out about 15% of the tokens in each input and ask the model to predict them from the surrounding context.

For example, given the input:

“The cat [MASK] on the [MASK] because it was tired”

The model must predict that the first blank is “sat” and the second is “mat” — using all the surrounding context from both directions. This forces the model to build deep, bidirectional representations that capture the relationships between words.

Each [MASK] token is predicted using context from both directions — the hallmark of encoder-only pretraining.

Figure 2:Each [MASK] token is predicted using context from both directions — the hallmark of encoder-only pretraining.

Let’s visualize the difference between the attention patterns. In the encoder, every token sees every other token. Compare this with the causal mask we saw last week, where each token can only see tokens that came before it:

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

seq_len = 5
tokens = ["The", "cat", "sat", "on", "mat"]

# Encoder-only: bidirectional (full attention)
bidir_mask = torch.ones(seq_len, seq_len)
axes[0].imshow(bidir_mask, cmap="Blues", vmin=0, vmax=1)
axes[0].set_title("Encoder-Only\n(Bidirectional)", fontsize=13)
axes[0].set_xticks(range(seq_len))
axes[0].set_yticks(range(seq_len))
axes[0].set_xticklabels(tokens, fontsize=10)
axes[0].set_yticklabels(tokens, fontsize=10)
axes[0].set_xlabel("Attends to")
axes[0].set_ylabel("Token")

# Decoder-only: causal (lower triangular)
causal_mask = torch.tril(torch.ones(seq_len, seq_len))
axes[1].imshow(causal_mask, cmap="Oranges", vmin=0, vmax=1)
axes[1].set_title("Decoder-Only\n(Causal)", fontsize=13)
axes[1].set_xticks(range(seq_len))
axes[1].set_yticks(range(seq_len))
axes[1].set_xticklabels(tokens, fontsize=10)
axes[1].set_yticklabels(tokens, fontsize=10)
axes[1].set_xlabel("Attends to")
axes[1].set_ylabel("Token")

plt.tight_layout()
plt.show()
<Figure size 1000x400 with 2 Axes>

The blue grid on the left is all ones — every token can attend to every other token. The orange triangular matrix on the right enforces causal ordering — token ii can only attend to tokens 1,2,,i1, 2, \ldots, i.

This single difference — the shape of the attention mask — is what separates the two model families.

From BERT to RoBERTa

The first major encoder-only model was BERT (Bidirectional Encoder Representations from Transformers), published by Devlin et al. at Google in 2018. BERT was a landmark result: it achieved state-of-the-art on 11 NLP benchmarks simultaneously, and it demonstrated the power of bidirectional pretraining.

BERT came in two sizes:

ModelLayersHidden SizeAttention HeadsParameters
BERT-base1276812110M
BERT-large24102416340M

BERT’s pretraining used two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The NSP task trained the model to predict whether two sentences appeared consecutively in the original text. The idea was to help BERT understand relationships between sentences.

However, subsequent research found that BERT’s recipe could be significantly improved. In 2019, Liu et al. published RoBERTa (Robustly Optimized BERT Pretraining Approach), which kept BERT’s architecture but made several training improvements:

  1. Dropped the NSP objective — it turned out to be unnecessary, and even slightly harmful

  2. Trained on more data — 160GB vs. BERT’s 16GB of text

  3. Trained for longer — more steps with larger batches

  4. Dynamic masking — instead of masking the same tokens every epoch, the masking pattern changes each time the model sees a sequence

The result? RoBERTa matched or exceeded BERT’s performance on every benchmark, demonstrating that BERT was significantly undertrained. This makes RoBERTa the canonical encoder-only model we reference today.

What Encoder-Only Models Excel At

Because encoder-only models build bidirectional representations, they are naturally suited for tasks that require understanding the full context of a piece of text:

What they cannot do well is generate text. Because the model sees all positions simultaneously during training, it has no natural mechanism for producing tokens one at a time. You can’t ask RoBERTa to “continue this sentence” — it was trained to fill in blanks, not to write left to right.


Decoder-Only Models

The Key Idea: Causal Language Modeling

Decoder-only models take the opposite approach: they use only the decoder stack with causal masking, where each token can only attend to itself and all preceding tokens. This is the same lower-triangular mask we studied in Week 6.

The pretraining objective is causal language modeling — predicting the next token given all previous tokens. Given a sequence of tokens x1,x2,,xnx_1, x_2, \ldots, x_n, the model learns to predict:

P(xtx1,x2,,xt1)P(x_t \mid x_1, x_2, \ldots, x_{t-1})

for every position tt in the sequence. This is the classic language modeling task, the same one that N-gram models attempted decades ago — but now powered by the full expressiveness of the Transformer architecture.

Each token predicts the next token using only left context — the hallmark of decoder-only pretraining.

Figure 3:Each token predicts the next token using only left context — the hallmark of decoder-only pretraining.

The GPT Lineage

The decoder-only story begins with GPT (Generative Pre-trained Transformer), published by Radford et al. at OpenAI in 2018 — the same year as BERT. The two papers represented competing bets on which half of the Transformer would prove more useful.

The GPT lineage tracks the rapid scaling of decoder-only models:

ModelYearParametersKey Innovation
GPT2018117MShowed that pretraining a decoder works for downstream tasks
GPT-220191.5BDemonstrated coherent long-form text generation
GPT-32020175BDiscovered in-context learning and few-shot prompting
GPT-42023UndisclosedMultimodal input, advanced reasoning

Each generation grew dramatically in scale, and with that scale came qualitatively new capabilities. GPT-3 was particularly significant because it showed that a sufficiently large decoder-only model could perform tasks it was never explicitly trained for — simply by providing a few examples in the prompt. This in-context learning ability changed the field overnight.

The rapid scaling of decoder-only models from GPT (117M parameters) to GPT-4, with qualitatively new capabilities emerging at each scale.

Figure 4:The rapid scaling of decoder-only models from GPT (117M parameters) to GPT-4, with qualitatively new capabilities emerging at each scale.

Why Causal Masking Enables Generation

The beauty of causal masking is that it makes generation trivially straightforward. During training, the model learns to predict each next token from its predecessors. At inference time, we can generate text by:

  1. Feed in a prompt: x1,x2,,xkx_1, x_2, \ldots, x_k

  2. The model predicts a distribution over the next token: P(xk+1x1,,xk)P(x_{k+1} \mid x_1, \ldots, x_k)

  3. Sample or pick the most likely token

  4. Append it to the sequence and repeat

This autoregressive process generates text one token at a time, and it works seamlessly because the model was trained in exactly this left-to-right fashion.

The Surprising Discovery: Understanding Through Generation

Here’s what caught the field by surprise: with enough scale, decoder-only models can also understand text. GPT-3 showed that you can do text classification, NER, translation, and many other “understanding” tasks simply by framing them as text generation problems.

Want sentiment analysis? Just prompt the model:

“Review: This movie was fantastic! Sentiment:”

And the model generates “Positive.” No fine-tuning needed — just a well-crafted prompt.

This prompt-based approach to NLP tasks is why decoder-only models have become so dominant. Instead of needing separate, fine-tuned models for each task (as with BERT/RoBERTa), a single large decoder-only model can handle almost anything through appropriate prompting. Models like GPT-4, Claude, and LLaMA are all decoder-only Transformers.

Why Decoder-Only Won (So Far)

The dominance of decoder-only models is one of the most important developments in modern NLP — and it wasn’t obvious in advance. In 2018, BERT and GPT were published in the same year, representing competing bets on which half of the Transformer mattered more. BERT won the initial benchmarks convincingly. But within five years, the field had decisively shifted to decoder-only. Why?

Several factors converged:

Training signal density. This is perhaps the most fundamental advantage. In causal language modeling, every token in the training data provides a learning signal — the model must predict token tt from tokens 1,,t11, \ldots, t{-}1. With MLM (the encoder-only objective), only the ~15% of randomly masked tokens contribute to the loss. The remaining 85% provide context but don’t directly drive weight updates. For the same training data, a decoder-only model extracts roughly 6–7x more gradient signal per pass through the corpus. When you’re training on trillions of tokens, that efficiency gap is enormous.

Predictable scaling. In 2020, Kaplan et al. at OpenAI discovered that decoder-only model performance follows remarkably clean power laws: as you increase model size, dataset size, or compute, the loss decreases along smooth, predictable curves. This was transformative — it meant labs could invest billions of dollars in larger models with reasonable confidence about the outcome. These scaling laws weren’t established for encoder-only or encoder-decoder models at comparable scales, so investment naturally concentrated on the architecture with the clearest roadmap.

The output bottleneck of encoders. Consider what happens when you scale an encoder-only model to 175B parameters. Its output is still a fixed-dimensional vector — a 768- or 1024-dimensional [CLS] representation — that must be routed through a small, task-specific classification head. Even a massive encoder can only “answer” in the limited vocabulary of that head (e.g., “positive” vs. “negative”). A decoder-only model’s output is generated text, which can express anything: labels, translations, reasoning chains, code, poetry. This means decoder-only models can demonstrate new capabilities simply by generating new kinds of text, while encoder-only models are structurally limited to the tasks their heads were designed for.

Next-token prediction is a harder (and richer) objective. MLM asks: “given all the words around this blank, what word goes here?” The bidirectional context heavily constrains the answer — there’s usually only one or a few words that fit. Causal language modeling asks: “given only what came before, what comes next?” This is a fundamentally harder problem. With only left context, the model must build deeper internal models of language structure, world knowledge, and reasoning to make good predictions. It can’t “cheat” by looking at words to the right. At scale, this harder objective forces the model to develop richer internal representations.

Generation naturally supports in-context learning. When a user puts examples in a prompt, a decoder-only model processes them left-to-right, building up an internal representation of “what task is being demonstrated.” Each subsequent prediction is conditioned on all the examples that came before. This sequential conditioning is the mechanism behind in-context learning — and it simply doesn’t exist in encoder-only models, which process all tokens simultaneously and have no concept of “now generate an answer to what came before.”

Architectural simplicity. A decoder-only model is a single stack of Transformer blocks with a causal mask. No cross-attention layers, no separate encoder and decoder to coordinate. This simplicity has real engineering benefits: easier parallelization across GPUs, more straightforward inference optimization (particularly KV-caching, where previously computed key-value pairs are reused as each new token is generated), and simpler codebases to maintain and debug at scale.

These factors are mutually reinforcing. Training efficiency → more investment in scaling → scaling reveals emergent capabilities unique to the generation paradigm → more interest → more investment. By the time the field realized how powerful scaled decoder-only models could be, the gap in research attention (and funding) had become self-perpetuating.


Encoder-Decoder Models

The Key Idea: Understand Then Generate

Encoder-decoder models keep the full original Transformer architecture: an encoder that processes the input with bidirectional attention, and a decoder that generates the output autoregressively with causal masking. The two halves are connected through cross-attention — the decoder attends to the encoder’s output representations when generating each token.

This architecture is a natural fit for tasks where the input and output are different sequences. The encoder builds a rich, bidirectional understanding of the input, and the decoder uses that understanding to generate a new sequence.

The encoder builds bidirectional representations of the input; the decoder generates the output autoregressively, attending to the encoder’s representations via cross-attention.

Figure 5:The encoder builds bidirectional representations of the input; the decoder generates the output autoregressively, attending to the encoder’s representations via cross-attention.

T5: Everything is Text-to-Text

The most influential encoder-decoder model is T5 (Text-to-Text Transfer Transformer), published by Raffel et al. at Google in 2020. T5’s key innovation was a unifying idea: every NLP task can be framed as transforming one text string into another.

Instead of adding task-specific heads (like BERT does for classification vs. NER vs. QA), T5 prepends a task prefix to the input and generates the output as text:

TaskInputOutput
Translation“translate English to German: That is good”“Das ist gut”
Summarization“summarize: The lengthy article about...”“Article discusses...”
Classification“classify sentiment: This movie was great!”“positive”
QA“question: What is the capital? context: France’s capital is Paris.”“Paris”

This text-to-text framing is elegant: one model, one format, many tasks. T5 was pretrained using a variation of masked language modeling called span corruption — instead of masking individual tokens, it masks contiguous spans of text, and the model must generate the missing spans.

BART: A Denoising Perspective

Another notable encoder-decoder model is BART (Bidirectional and Auto-Regressive Transformer), published by Lewis et al. at Facebook in 2019. BART takes a different angle on pretraining: it corrupts the input text in various ways (masking, deletion, permutation, rotation) and trains the model to reconstruct the original. This denoising autoencoder approach makes BART particularly strong at text generation tasks like abstractive summarization.

When to Use Encoder-Decoder

Encoder-decoder models shine when:

The trade-off is complexity: encoder-decoder models have roughly twice the parameters of a single-stack model at the same depth (because you’re maintaining two separate stacks plus cross-attention layers). As decoder-only models have grown powerful enough to handle many of these tasks through prompting alone, the practical advantages of encoder-decoder models have narrowed. But for tasks like translation, where the structural separation between input and output is clear, they remain a strong choice.


The Big Picture

Let’s bring everything together. The three Transformer variants are defined by two choices: which components of the original architecture you keep, and what pretraining objective you use.

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

tokens = ["T₁", "T₂", "T₃", "T₄"]
n = len(tokens)

# Encoder-only: bidirectional
bidir = torch.ones(n, n)
axes[0].imshow(bidir, cmap="Blues", vmin=0, vmax=1.3)
axes[0].set_title("Encoder-Only\n(BERT, RoBERTa)", fontsize=12, fontweight="bold")
axes[0].set_xticks(range(n)); axes[0].set_yticks(range(n))
axes[0].set_xticklabels(tokens); axes[0].set_yticklabels(tokens)

# Decoder-only: causal
causal = torch.tril(torch.ones(n, n))
im2 = axes[1].imshow(causal, cmap="Oranges", vmin=0, vmax=1.3)
axes[1].set_title("Decoder-Only\n(GPT, LLaMA, Claude)", fontsize=12, fontweight="bold")
axes[1].set_xticks(range(n)); axes[1].set_yticks(range(n))
axes[1].set_xticklabels(tokens); axes[1].set_yticklabels(tokens)

# Encoder-decoder: show both
# Left half bidirectional (encoder attending to input), right half causal (decoder)
enc_dec = torch.ones(n, n)
for i in range(n):
    for j in range(n):
        if j > i:
            # Upper triangle gets partial shading to indicate cross-attention
            enc_dec[i, j] = 0.5
axes[2].imshow(enc_dec, cmap="Greens", vmin=0, vmax=1.3)
axes[2].set_title("Encoder-Decoder\n(T5, BART)", fontsize=12, fontweight="bold")
axes[2].set_xticks(range(n)); axes[2].set_yticks(range(n))
axes[2].set_xticklabels(tokens); axes[2].set_yticklabels(tokens)

for ax in axes:
    ax.set_xlabel("Attends to →")
    ax.set_ylabel("← Token")

plt.tight_layout()
plt.show()
<Figure size 1400x400 with 3 Axes>

Architecture Comparison

Table 1:Transformer Model Variants Compared

Encoder-OnlyDecoder-OnlyEncoder-Decoder
ArchitectureEncoder stack onlyDecoder stack onlyBoth stacks + cross-attention
AttentionBidirectional (full)Causal (left-to-right)Bidirectional encoder, causal decoder
PretrainingMasked Language ModelingNext-token predictionSpan corruption or denoising
StrengthsUnderstanding, classification, extractionGeneration, few-shot, in-context learningSequence-to-sequence transformation
LimitationsCannot generate textMisses right context at each positionMore parameters, higher complexity
Canonical ModelsBERT, RoBERTa, DeBERTaGPT-3/4, LLaMA, Claude, GeminiT5, BART, mBART
Use When...You need to classify, extract, or compareYou need to generate or converseInput → different output (translate, summarize)
A practical decision tree for choosing the right transformer architecture based on your NLP task.

Figure 6:A practical decision tree for choosing the right transformer architecture based on your NLP task.

The Trend: Decoder-Only Dominance

If you look at the most capable models today — GPT-4, Claude, LLaMA, Gemini — they are all decoder-only. This doesn’t mean encoder-only and encoder-decoder models are obsolete. Rather, the landscape has evolved:

The key takeaway is that architecture choice is a design decision driven by your task, your constraints, and your deployment context. Understanding all three variants lets you make that decision wisely.


Wrap-Up

Key Takeaways

What’s Next

In Part 02, we’ll get hands-on with the Hugging Face ecosystem — the toolkit that makes all three of these model families accessible through a unified Python API. You’ll learn to load pretrained models and tokenizers from the Hugging Face Hub, use high-level pipelines for rapid prototyping, and work with the Datasets library for loading and processing data. The conceptual understanding of model variants you’ve built today will directly inform which models you reach for and why.