Fine-Tuning and Alignment

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L08.01: Foundation models, scaling laws, open vs. closed models
L07.02: The Hugging Face ecosystem — pipelines, model hub, and datasets library

Outcomes

Distinguish full fine-tuning from parameter-efficient methods (LoRA, QLoRA) and explain when each is appropriate
Describe the post-training pipeline: pretraining → supervised fine-tuning → preference alignment
Explain how RLHF (and DPO as an alternative) aligns LLMs with human preferences
Identify domain adaptation strategies and when to apply them

References

Why Fine-Tune?¶

In Part 01, we saw that foundation models gain remarkable capabilities through scale alone. GPT-5.4 can write essays, Opus 4.6 can reason through complex problems, and Qwen 3.5 can generate working code — all without any task-specific training.

But here’s a question worth asking: if foundation models can already do so much, why would anyone bother training them further?

The answer is that general capability isn’t the same as specific capability. A foundation model knows a little about everything, but your application probably needs it to know a lot about one thing. A hospital needs a model that understands clinical terminology and discharge note conventions. A legal firm needs one that can parse case law with precision. A customer support team needs one that stays on-brand and follows company policies.

This is where post-training comes in — the set of techniques used to take a general-purpose foundation model and make it yours. Over the next few sections, we’ll explore the three main approaches: full fine-tuning, parameter-efficient fine-tuning, and alignment through human feedback. Each makes different trade-offs between cost, control, and capability.

The Fine-Tuning Spectrum¶

Not all fine-tuning is created equal. The techniques available to us form a spectrum from “change everything” to “change nothing,” with very different costs and use cases at each point.

Full Fine-Tuning¶

The most straightforward approach: take the pretrained model, unfreeze all its weights, and continue training on your task-specific data. This is exactly what we did when fine-tuning BERT for classification in Week 7 — we updated every parameter in the model.

Full fine-tuning gives you maximum control. You can reshape the model’s behavior completely for your domain. But it comes with serious drawbacks:

Memory: You need enough GPU memory to hold the full model, its gradients, and the optimizer states — roughly 4x the model’s size in memory
Cost: Training a 70B model end-to-end requires a cluster of high-end GPUs
Catastrophic forgetting: The model may lose its general capabilities as it overfits to your narrow dataset

For models under ~1B parameters (like BERT), full fine-tuning is still practical and often the best choice. For today’s foundation models with tens or hundreds of billions of parameters, we need something more efficient.

LoRA: Low-Rank Adaptation¶

LoRA is the most popular parameter-efficient fine-tuning method, and the idea behind it is elegant. Instead of updating all the weights in a layer, LoRA freezes the original weights and adds a small pair of low-rank matrices that learn the change to each layer.

Here’s the intuition: a weight matrix $W$ in a Transformer has dimensions $d \times d$ (often 4096 × 4096 or larger). LoRA decomposes the update into two much smaller matrices: $A$ (of size $d \times r$ ) and $B$ (of size $r \times d$ ), where $r$ (the “rank”) is tiny — typically 8, 16, or 32. The effective weight becomes $W + BA$ , but you only train the parameters in $A$ and $B$ .

The savings are dramatic. For a layer with a 4096 × 4096 weight matrix (16.7M parameters), a rank-16 LoRA adds only 2 × 4096 × 16 = 131K trainable parameters — less than 1% of the original.

# Conceptual LoRA setup with Hugging Face PEFT
# (Not executed — requires GPU and a model download)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# 1. Load a pretrained model (all weights frozen by default with PEFT)
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")

# 2. Define LoRA configuration
lora_config = LoraConfig(
    r=16,                        # Rank of the low-rank matrices
    lora_alpha=32,               # Scaling factor (higher = stronger adaptation)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,           # Dropout for regularization
    task_type="CAUSAL_LM",       # Task type for the model
)

# 3. Wrap the model with LoRA adapters
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# → "trainable params: 1,572,864 || all params: 630,000,000 || trainable%: 0.25%"

The beauty of LoRA is that the adapters are tiny files (often just a few MB) that can be loaded on top of the base model at inference time. You can even train multiple LoRA adapters for different tasks and swap them in and out without reloading the base model.

QLoRA: Quantization Meets LoRA¶

QLoRA takes LoRA one step further by quantizing the frozen base model to 4-bit precision before attaching LoRA adapters. This reduces the memory footprint of the base model by roughly 4x, meaning you can fine-tune a 70B model on a single consumer GPU with 48GB of VRAM.

The key insight is that the base weights don’t need full precision during fine-tuning — only the LoRA adapter weights are updated, and those stay in higher precision. In practice, QLoRA achieves results very close to full-precision LoRA with dramatically less hardware.

The Post-Training Pipeline¶

Fine-tuning techniques tell us how to update a model, but they don’t tell us what to train it on or why. The modern LLM pipeline has a clear three-stage structure, and understanding each stage is crucial for knowing what makes models like Claude Opus 4.6 or GPT-5.4 actually useful.

Figure 2:The three stages of building a modern LLM: pretraining learns language, supervised fine-tuning teaches task following, and preference alignment makes the model helpful, harmless, and honest.

Stage 1: Pretraining¶

We covered this in Part 01. The model trains on trillions of tokens of text from the internet, books, code, and other sources. The objective is simple: predict the next token. This stage produces a base model — one that can continue any text plausibly, but doesn’t know how to be helpful, follow instructions, or have a conversation.

If you prompt a base model with “What is the capital of France?”, it might respond with “What is the capital of Germany? What is the capital of Italy?” — because in its training data, questions are often followed by more questions, not answers.

Stage 2: Supervised Fine-Tuning (SFT)¶

Supervised fine-tuning, also called instruction tuning, transforms the base model into one that can follow instructions. The training data consists of (instruction, response) pairs — thousands to millions of examples of questions paired with high-quality answers.

After SFT, the model understands the format of being helpful: it responds to questions with answers, follows multi-step instructions, formats code correctly, and so on. This is a dramatic behavioral shift from the base model, even though the underlying knowledge was already there from pretraining.

Stage 3: Preference Alignment¶

SFT teaches the model what to say, but not always how well to say it. A model might generate several plausible responses to a question — some more helpful, more accurate, or safer than others. Preference alignment teaches the model to distinguish between better and worse responses.

RLHF is the original approach, used by OpenAI for ChatGPT and by Anthropic for Claude. It works in two steps:

Train a reward model: Human annotators compare pairs of model responses and indicate which is better. A separate model learns to predict these preferences, essentially learning a scoring function for “response quality.”
Optimize with RL: The LLM is then fine-tuned using reinforcement learning (specifically, PPO — Proximal Policy Optimization) to generate responses that score highly according to the reward model, while staying close to the SFT model to prevent reward hacking.

DPO is a newer, simpler alternative. Instead of training a separate reward model and using RL, DPO directly optimizes the language model on preference pairs. It treats the LLM itself as an implicit reward model, which eliminates the complexity of the RL training loop. In practice, DPO achieves comparable alignment quality with less infrastructure.

A related technique is RLAIF, where the preference labels come from another LLM rather than human annotators — trading annotation cost for scale.

Domain Adaptation Strategies¶

So far, we’ve discussed fine-tuning in the context of making a model more generally useful. But what about making it expert-level in a specific domain? This is domain adaptation, and there are several strategies depending on your needs and resources.

Figure 3:Choosing an adaptation strategy depends on your data, compute budget, and how different your domain is from the model’s pretraining distribution.

Continued Pretraining¶

If your domain has its own vocabulary and conventions (medical, legal, financial text), you can continue the base model’s pretraining on a large corpus of domain text. This teaches the model the “language” of your domain before any task-specific training. For example, Bloomberg trained BloombergGPT on financial text, and there are similar efforts for clinical, legal, and scientific domains.

Task-Specific Fine-Tuning¶

When you have labeled examples for a specific task (classify these radiology reports, extract clauses from contracts), you fine-tune on those examples — either fully or with LoRA. This is the most common approach and often the most practical.

RAG as an Alternative¶

Sometimes you don’t need to change the model at all. Retrieval-Augmented Generation (RAG) lets the model access external documents at inference time, grounding its responses in your data without any training. We’ll explore RAG in depth in Week 10, but it’s worth noting here as a key alternative to fine-tuning — especially when your knowledge base changes frequently.

When to Use What?¶

The choice depends on several factors:

Factor	Fine-Tune	RAG	Prompting Only
Domain is very different from general text	Best	Good	Weak
Knowledge changes frequently	Retrain needed	Best	Best
Need consistent output format/style	Best	Good	Moderate
Limited labeled data	LoRA with few examples	Best	Best
Privacy: can’t send data to an API	Open model + FT	Open model + RAG	Open model only

Exercise 8.3: Choosing an Adaptation Strategy

For each scenario, recommend the most appropriate adaptation strategy and justify your choice. Consider cost, data requirements, and the specific needs of each use case.

(a) A pharmaceutical company wants their LLM to accurately extract adverse events from clinical trial reports. They have 5,000 labeled examples and use specialized medical terminology.

(b) A tech company wants their customer support chatbot to answer questions about their products. Their documentation is updated weekly.

(c) A law firm wants an LLM that writes legal memos in their house style, with specific formatting conventions and citation patterns. They have 200 example memos.

(d) A research lab wants to analyze astronomy papers. The model needs to understand domain jargon but the main task is open-ended summarization (no labeled data).

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Full fine-tuning updates all model weights — powerful but expensive and risks catastrophic forgetting
LoRA freezes base weights and trains small low-rank adapter matrices, achieving comparable results with <1% of trainable parameters
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of 70B+ models on consumer hardware
The post-training pipeline has three stages: pretraining (learn language) → SFT (learn to follow instructions) → preference alignment (learn to be helpful and safe)
RLHF aligns models using human preference data and a trained reward model; DPO achieves similar results more simply without a separate reward model
Domain adaptation strategies range from continued pretraining (for domain language) to task-specific fine-tuning (for labeled tasks) to RAG (for dynamic knowledge)
The right strategy depends on your constraints — data availability, compute budget, how different your domain is, and whether knowledge changes over time

What’s Next¶

In Part 03, we’ll shift from customizing models to using them. We’ll explore the LLM API landscape — OpenAI, Anthropic, and Google — learning how to make API calls, manage authentication, and understand token pricing. This is where foundation models become practical tools you can integrate into applications.