Fine-Tuning and Alignment
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
L08.01: Foundation models, scaling laws, open vs. closed models
L07.02: The Hugging Face ecosystem — pipelines, model hub, and datasets library
Outcomes
Distinguish full fine-tuning from parameter-efficient methods (LoRA, QLoRA) and explain when each is appropriate
Describe the post-training pipeline: pretraining → supervised fine-tuning → preference alignment
Explain how RLHF (and DPO as an alternative) aligns LLMs with human preferences
Identify domain adaptation strategies and when to apply them
References
Why Fine-Tune?¶
In Part 01, we saw that foundation models gain remarkable capabilities through scale alone. GPT-5.4 can write essays, Opus 4.6 can reason through complex problems, and Qwen 3.5 can generate working code — all without any task-specific training.
But here’s a question worth asking: if foundation models can already do so much, why would anyone bother training them further?
The answer is that general capability isn’t the same as specific capability. A foundation model knows a little about everything, but your application probably needs it to know a lot about one thing. A hospital needs a model that understands clinical terminology and discharge note conventions. A legal firm needs one that can parse case law with precision. A customer support team needs one that stays on-brand and follows company policies.
This is where post-training comes in — the set of techniques used to take a general-purpose foundation model and make it yours. Over the next few sections, we’ll explore the three main approaches: full fine-tuning, parameter-efficient fine-tuning, and alignment through human feedback. Each makes different trade-offs between cost, control, and capability.
The Fine-Tuning Spectrum¶
Not all fine-tuning is created equal. The techniques available to us form a spectrum from “change everything” to “change nothing,” with very different costs and use cases at each point.
Figure 1:The fine-tuning spectrum: from full fine-tuning (most expensive, most control) to prompting (cheapest, least control). Parameter-efficient methods like LoRA and QLoRA offer a practical middle ground.
Full Fine-Tuning¶
The most straightforward approach: take the pretrained model, unfreeze all its weights, and continue training on your task-specific data. This is exactly what we did when fine-tuning BERT for classification in Week 7 — we updated every parameter in the model.
Full fine-tuning gives you maximum control. You can reshape the model’s behavior completely for your domain. But it comes with serious drawbacks:
Memory: You need enough GPU memory to hold the full model, its gradients, and the optimizer states — roughly 4x the model’s size in memory
Cost: Training a 70B model end-to-end requires a cluster of high-end GPUs
Catastrophic forgetting: The model may lose its general capabilities as it overfits to your narrow dataset
For models under ~1B parameters (like BERT), full fine-tuning is still practical and often the best choice. For today’s foundation models with tens or hundreds of billions of parameters, we need something more efficient.
LoRA: Low-Rank Adaptation¶
LoRA is the most popular parameter-efficient fine-tuning method, and the idea behind it is elegant. Instead of updating all the weights in a layer, LoRA freezes the original weights and adds a small pair of low-rank matrices that learn the change to each layer.
Here’s the intuition: a weight matrix in a Transformer has dimensions (often 4096 × 4096 or larger). LoRA decomposes the update into two much smaller matrices: (of size ) and (of size ), where (the “rank”) is tiny — typically 8, 16, or 32. The effective weight becomes , but you only train the parameters in and .
The savings are dramatic. For a layer with a 4096 × 4096 weight matrix (16.7M parameters), a rank-16 LoRA adds only 2 × 4096 × 16 = 131K trainable parameters — less than 1% of the original.
# Conceptual LoRA setup with Hugging Face PEFT
# (Not executed — requires GPU and a model download)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# 1. Load a pretrained model (all weights frozen by default with PEFT)
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
# 2. Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices
lora_alpha=32, # Scaling factor (higher = stronger adaptation)
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05, # Dropout for regularization
task_type="CAUSAL_LM", # Task type for the model
)
# 3. Wrap the model with LoRA adapters
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# → "trainable params: 1,572,864 || all params: 630,000,000 || trainable%: 0.25%"The beauty of LoRA is that the adapters are tiny files (often just a few MB) that can be loaded on top of the base model at inference time. You can even train multiple LoRA adapters for different tasks and swap them in and out without reloading the base model.
QLoRA: Quantization Meets LoRA¶
QLoRA takes LoRA one step further by quantizing the frozen base model to 4-bit precision before attaching LoRA adapters. This reduces the memory footprint of the base model by roughly 4x, meaning you can fine-tune a 70B model on a single consumer GPU with 48GB of VRAM.
The key insight is that the base weights don’t need full precision during fine-tuning — only the LoRA adapter weights are updated, and those stay in higher precision. In practice, QLoRA achieves results very close to full-precision LoRA with dramatically less hardware.
The Post-Training Pipeline¶
Fine-tuning techniques tell us how to update a model, but they don’t tell us what to train it on or why. The modern LLM pipeline has a clear three-stage structure, and understanding each stage is crucial for knowing what makes models like Claude Opus 4.6 or GPT-5.4 actually useful.
Figure 2:The three stages of building a modern LLM: pretraining learns language, supervised fine-tuning teaches task following, and preference alignment makes the model helpful, harmless, and honest.
Stage 1: Pretraining¶
We covered this in Part 01. The model trains on trillions of tokens of text from the internet, books, code, and other sources. The objective is simple: predict the next token. This stage produces a base model — one that can continue any text plausibly, but doesn’t know how to be helpful, follow instructions, or have a conversation.
If you prompt a base model with “What is the capital of France?”, it might respond with “What is the capital of Germany? What is the capital of Italy?” — because in its training data, questions are often followed by more questions, not answers.
Stage 2: Supervised Fine-Tuning (SFT)¶
Supervised fine-tuning, also called instruction tuning, transforms the base model into one that can follow instructions. The training data consists of (instruction, response) pairs — thousands to millions of examples of questions paired with high-quality answers.
After SFT, the model understands the format of being helpful: it responds to questions with answers, follows multi-step instructions, formats code correctly, and so on. This is a dramatic behavioral shift from the base model, even though the underlying knowledge was already there from pretraining.
Stage 3: Preference Alignment¶
SFT teaches the model what to say, but not always how well to say it. A model might generate several plausible responses to a question — some more helpful, more accurate, or safer than others. Preference alignment teaches the model to distinguish between better and worse responses.
RLHF is the original approach, used by OpenAI for ChatGPT and by Anthropic for Claude. It works in two steps:
Train a reward model: Human annotators compare pairs of model responses and indicate which is better. A separate model learns to predict these preferences, essentially learning a scoring function for “response quality.”
Optimize with RL: The LLM is then fine-tuned using reinforcement learning (specifically, PPO — Proximal Policy Optimization) to generate responses that score highly according to the reward model, while staying close to the SFT model to prevent reward hacking.
DPO is a newer, simpler alternative. Instead of training a separate reward model and using RL, DPO directly optimizes the language model on preference pairs. It treats the LLM itself as an implicit reward model, which eliminates the complexity of the RL training loop. In practice, DPO achieves comparable alignment quality with less infrastructure.
A related technique is RLAIF, where the preference labels come from another LLM rather than human annotators — trading annotation cost for scale.
Domain Adaptation Strategies¶
So far, we’ve discussed fine-tuning in the context of making a model more generally useful. But what about making it expert-level in a specific domain? This is domain adaptation, and there are several strategies depending on your needs and resources.
Figure 3:Choosing an adaptation strategy depends on your data, compute budget, and how different your domain is from the model’s pretraining distribution.
Continued Pretraining¶
If your domain has its own vocabulary and conventions (medical, legal, financial text), you can continue the base model’s pretraining on a large corpus of domain text. This teaches the model the “language” of your domain before any task-specific training. For example, Bloomberg trained BloombergGPT on financial text, and there are similar efforts for clinical, legal, and scientific domains.
Task-Specific Fine-Tuning¶
When you have labeled examples for a specific task (classify these radiology reports, extract clauses from contracts), you fine-tune on those examples — either fully or with LoRA. This is the most common approach and often the most practical.
RAG as an Alternative¶
Sometimes you don’t need to change the model at all. Retrieval-Augmented Generation (RAG) lets the model access external documents at inference time, grounding its responses in your data without any training. We’ll explore RAG in depth in Week 10, but it’s worth noting here as a key alternative to fine-tuning — especially when your knowledge base changes frequently.
When to Use What?¶
The choice depends on several factors:
| Factor | Fine-Tune | RAG | Prompting Only |
|---|---|---|---|
| Domain is very different from general text | Best | Good | Weak |
| Knowledge changes frequently | Retrain needed | Best | Best |
| Need consistent output format/style | Best | Good | Moderate |
| Limited labeled data | LoRA with few examples | Best | Best |
| Privacy: can’t send data to an API | Open model + FT | Open model + RAG | Open model only |
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In Part 03, we’ll shift from customizing models to using them. We’ll explore the LLM API landscape — OpenAI, Anthropic, and Google — learning how to make API calls, manage authentication, and understand token pricing. This is where foundation models become practical tools you can integrate into applications.