Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Fine-Tuning and Alignment

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Why Fine-Tune?

In Part 01, we saw that foundation models gain remarkable capabilities through scale alone. GPT-5.4 can write essays, Opus 4.6 can reason through complex problems, and Qwen 3.5 can generate working code — all without any task-specific training.

But here’s a question worth asking: if foundation models can already do so much, why would anyone bother training them further?

The answer is that general capability isn’t the same as specific capability. A foundation model knows a little about everything, but your application probably needs it to know a lot about one thing. A hospital needs a model that understands clinical terminology and discharge note conventions. A legal firm needs one that can parse case law with precision. A customer support team needs one that stays on-brand and follows company policies.

This is where post-training comes in — the set of techniques used to take a general-purpose foundation model and make it yours. Over the next few sections, we’ll explore the three main approaches: full fine-tuning, parameter-efficient fine-tuning, and alignment through human feedback. Each makes different trade-offs between cost, control, and capability.


The Fine-Tuning Spectrum

Not all fine-tuning is created equal. The techniques available to us form a spectrum from “change everything” to “change nothing,” with very different costs and use cases at each point.

The fine-tuning spectrum: from full fine-tuning (most expensive, most control) to prompting (cheapest, least control). Parameter-efficient methods like LoRA and QLoRA offer a practical middle ground.

Figure 1:The fine-tuning spectrum: from full fine-tuning (most expensive, most control) to prompting (cheapest, least control). Parameter-efficient methods like LoRA and QLoRA offer a practical middle ground.

Full Fine-Tuning

The most straightforward approach: take the pretrained model, unfreeze all its weights, and continue training on your task-specific data. This is exactly what we did when fine-tuning BERT for classification in Week 7 — we updated every parameter in the model.

Full fine-tuning gives you maximum control. You can reshape the model’s behavior completely for your domain. But it comes with serious drawbacks:

For models under ~1B parameters (like BERT), full fine-tuning is still practical and often the best choice. For today’s foundation models with tens or hundreds of billions of parameters, we need something more efficient.

LoRA: Low-Rank Adaptation

LoRA is the most popular parameter-efficient fine-tuning method, and the idea behind it is elegant. Instead of updating all the weights in a layer, LoRA freezes the original weights and adds a small pair of low-rank matrices that learn the change to each layer.

Here’s the intuition: a weight matrix WW in a Transformer has dimensions d×dd \times d (often 4096 × 4096 or larger). LoRA decomposes the update into two much smaller matrices: AA (of size d×rd \times r) and BB (of size r×dr \times d), where rr (the “rank”) is tiny — typically 8, 16, or 32. The effective weight becomes W+BAW + BA, but you only train the parameters in AA and BB.

The savings are dramatic. For a layer with a 4096 × 4096 weight matrix (16.7M parameters), a rank-16 LoRA adds only 2 × 4096 × 16 = 131K trainable parameters — less than 1% of the original.

# Conceptual LoRA setup with Hugging Face PEFT
# (Not executed — requires GPU and a model download)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# 1. Load a pretrained model (all weights frozen by default with PEFT)
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")

# 2. Define LoRA configuration
lora_config = LoraConfig(
    r=16,                        # Rank of the low-rank matrices
    lora_alpha=32,               # Scaling factor (higher = stronger adaptation)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,           # Dropout for regularization
    task_type="CAUSAL_LM",       # Task type for the model
)

# 3. Wrap the model with LoRA adapters
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# → "trainable params: 1,572,864 || all params: 630,000,000 || trainable%: 0.25%"

The beauty of LoRA is that the adapters are tiny files (often just a few MB) that can be loaded on top of the base model at inference time. You can even train multiple LoRA adapters for different tasks and swap them in and out without reloading the base model.

QLoRA: Quantization Meets LoRA

QLoRA takes LoRA one step further by quantizing the frozen base model to 4-bit precision before attaching LoRA adapters. This reduces the memory footprint of the base model by roughly 4x, meaning you can fine-tune a 70B model on a single consumer GPU with 48GB of VRAM.

The key insight is that the base weights don’t need full precision during fine-tuning — only the LoRA adapter weights are updated, and those stay in higher precision. In practice, QLoRA achieves results very close to full-precision LoRA with dramatically less hardware.


The Post-Training Pipeline

Fine-tuning techniques tell us how to update a model, but they don’t tell us what to train it on or why. The modern LLM pipeline has a clear three-stage structure, and understanding each stage is crucial for knowing what makes models like Claude Opus 4.6 or GPT-5.4 actually useful.

The three stages of building a modern LLM: pretraining learns language, supervised fine-tuning teaches task following, and preference alignment makes the model helpful, harmless, and honest.

Figure 2:The three stages of building a modern LLM: pretraining learns language, supervised fine-tuning teaches task following, and preference alignment makes the model helpful, harmless, and honest.

Stage 1: Pretraining

We covered this in Part 01. The model trains on trillions of tokens of text from the internet, books, code, and other sources. The objective is simple: predict the next token. This stage produces a base model — one that can continue any text plausibly, but doesn’t know how to be helpful, follow instructions, or have a conversation.

If you prompt a base model with “What is the capital of France?”, it might respond with “What is the capital of Germany? What is the capital of Italy?” — because in its training data, questions are often followed by more questions, not answers.

Stage 2: Supervised Fine-Tuning (SFT)

Supervised fine-tuning, also called instruction tuning, transforms the base model into one that can follow instructions. The training data consists of (instruction, response) pairs — thousands to millions of examples of questions paired with high-quality answers.

After SFT, the model understands the format of being helpful: it responds to questions with answers, follows multi-step instructions, formats code correctly, and so on. This is a dramatic behavioral shift from the base model, even though the underlying knowledge was already there from pretraining.

Stage 3: Preference Alignment

SFT teaches the model what to say, but not always how well to say it. A model might generate several plausible responses to a question — some more helpful, more accurate, or safer than others. Preference alignment teaches the model to distinguish between better and worse responses.

RLHF is the original approach, used by OpenAI for ChatGPT and by Anthropic for Claude. It works in two steps:

  1. Train a reward model: Human annotators compare pairs of model responses and indicate which is better. A separate model learns to predict these preferences, essentially learning a scoring function for “response quality.”

  2. Optimize with RL: The LLM is then fine-tuned using reinforcement learning (specifically, PPO — Proximal Policy Optimization) to generate responses that score highly according to the reward model, while staying close to the SFT model to prevent reward hacking.

DPO is a newer, simpler alternative. Instead of training a separate reward model and using RL, DPO directly optimizes the language model on preference pairs. It treats the LLM itself as an implicit reward model, which eliminates the complexity of the RL training loop. In practice, DPO achieves comparable alignment quality with less infrastructure.

A related technique is RLAIF, where the preference labels come from another LLM rather than human annotators — trading annotation cost for scale.


Domain Adaptation Strategies

So far, we’ve discussed fine-tuning in the context of making a model more generally useful. But what about making it expert-level in a specific domain? This is domain adaptation, and there are several strategies depending on your needs and resources.

Choosing an adaptation strategy depends on your data, compute budget, and how different your domain is from the model’s pretraining distribution.

Figure 3:Choosing an adaptation strategy depends on your data, compute budget, and how different your domain is from the model’s pretraining distribution.

Continued Pretraining

If your domain has its own vocabulary and conventions (medical, legal, financial text), you can continue the base model’s pretraining on a large corpus of domain text. This teaches the model the “language” of your domain before any task-specific training. For example, Bloomberg trained BloombergGPT on financial text, and there are similar efforts for clinical, legal, and scientific domains.

Task-Specific Fine-Tuning

When you have labeled examples for a specific task (classify these radiology reports, extract clauses from contracts), you fine-tune on those examples — either fully or with LoRA. This is the most common approach and often the most practical.

RAG as an Alternative

Sometimes you don’t need to change the model at all. Retrieval-Augmented Generation (RAG) lets the model access external documents at inference time, grounding its responses in your data without any training. We’ll explore RAG in depth in Week 10, but it’s worth noting here as a key alternative to fine-tuning — especially when your knowledge base changes frequently.

When to Use What?

The choice depends on several factors:

FactorFine-TuneRAGPrompting Only
Domain is very different from general textBestGoodWeak
Knowledge changes frequentlyRetrain neededBestBest
Need consistent output format/styleBestGoodModerate
Limited labeled dataLoRA with few examplesBestBest
Privacy: can’t send data to an APIOpen model + FTOpen model + RAGOpen model only

Wrap-Up

Key Takeaways

What’s Next

In Part 03, we’ll shift from customizing models to using them. We’ll explore the LLM API landscape — OpenAI, Anthropic, and Google — learning how to make API calls, manage authentication, and understand token pricing. This is where foundation models become practical tools you can integrate into applications.