Lab — RAG Builder
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
L10.01: RAG foundations — vector databases, ChromaDB, embedding models
L10.02: Building RAG pipelines — chunking, hybrid search, reranking with cross-encoders
Outcomes
Build a complete RAG system over a real multi-document corpus (Wikipedia articles)
Apply and compare the chunking, retrieval, and reranking techniques from Parts 01–02
Implement HyDE (Hypothetical Document Embeddings) as a query transformation technique
Evaluate RAG pipeline quality using manual scoring on a test query set
References
Setup¶
We’ll reuse the same tools from Parts 01 and 02. This cell loads everything we need.
import os
import re
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import cos_sim
from rank_bm25 import BM25Okapi
import chromadb
import wikipedia
load_dotenv()
PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"
def get_model(model_name: str) -> OpenAIChatModel:
"""Create a model connection through our LiteLLM proxy."""
return OpenAIChatModel(
model_name,
provider=OpenAIProvider(
base_url=PROXY_URL,
api_key=os.environ["CAP6640_API_KEY"],
),
)
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key | Status | |
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Part A: Build Your Corpus¶
In Parts 01 and 02, we worked with toy documents — a handful of hand-written sentences, then a single multi-paragraph text. Today we’ll build a RAG system over a real corpus: 10 Wikipedia articles covering topics we’ve studied this semester.
Fetching the Articles¶
topics = [
"Natural language processing",
"Tokenization (lexical analysis)",
"Word embedding",
"Tf–idf",
"Named-entity recognition",
"Recurrent neural network",
"Transformer (deep learning architecture)",
"BERT (language model)",
"GPT-4",
"Retrieval-augmented generation",
]
articles = {}
for topic in topics:
try:
page = wikipedia.page(topic, auto_suggest=False)
articles[topic] = page.content
print(f" {topic}: {len(page.content.split())} words")
except Exception as e:
print(f" {topic}: FAILED ({e})")
print(f"\nLoaded {len(articles)} articles") Natural language processing: 4538 words
Tokenization (lexical analysis): 3154 words
Word embedding: 1476 words
Tf–idf: 2699 words
Named-entity recognition: 2080 words
Recurrent neural network: 5859 words
Transformer (deep learning architecture): 9894 words
BERT (language model): 2481 words
GPT-4: 2268 words
Retrieval-augmented generation: 1616 words
Loaded 10 articles
Chunking the Corpus¶
We’ll use paragraph-based chunking — the best default from Part 02. Each chunk gets metadata tracking which article it came from, so we can cite sources later.
def chunk_paragraphs(text, max_words=100):
"""Split on paragraph boundaries, grouping short paragraphs."""
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip() and len(p.split()) > 10]
chunks = []
current = []
current_size = 0
for para in paragraphs:
words = len(para.split())
if current_size + words > max_words and current:
chunks.append(" ".join(current))
current = [para]
current_size = words
else:
current.append(para)
current_size += words
if current:
chunks.append(" ".join(current))
return chunks
# Chunk all articles, tracking source
all_chunks = []
chunk_sources = []
for topic, content in articles.items():
topic_chunks = chunk_paragraphs(content, max_words=100)
for chunk in topic_chunks:
all_chunks.append(chunk)
chunk_sources.append(topic)
print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f} words")
print(f"\nChunks per article:")
for topic in articles:
count = chunk_sources.count(topic)
print(f" {topic}: {count} chunks")Total chunks: 246
Average chunk size: 144 words
Chunks per article:
Natural language processing: 26 chunks
Tokenization (lexical analysis): 21 chunks
Word embedding: 9 chunks
Tf–idf: 31 chunks
Named-entity recognition: 14 chunks
Recurrent neural network: 43 chunks
Transformer (deep learning architecture): 53 chunks
BERT (language model): 20 chunks
GPT-4: 15 chunks
Retrieval-augmented generation: 14 chunks
Indexing: Vector Store + BM25¶
Now we set up both retrieval backends — dense (ChromaDB) and sparse (BM25) — so we can run hybrid search.
# Dense index (ChromaDB)
client = chromadb.Client()
# Delete collection if it exists from a previous run
try:
client.delete_collection("wiki_rag")
except Exception:
pass
collection = client.create_collection("wiki_rag", metadata={"hnsw:space": "cosine"})
# Embed and store all chunks
print("Embedding chunks... ", end="")
chunk_embeddings = embed_model.encode(all_chunks, show_progress_bar=True).tolist()
collection.add(
ids=[f"chunk_{i}" for i in range(len(all_chunks))],
documents=all_chunks,
embeddings=chunk_embeddings,
metadatas=[{"source": src} for src in chunk_sources],
)
print(f"Indexed {collection.count()} chunks in ChromaDB")
# Sparse index (BM25)
tokenized_chunks = [chunk.lower().split() for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_chunks)
print(f"Built BM25 index over {len(tokenized_chunks)} chunks")Embedding chunks... Indexed 246 chunks in ChromaDB
Built BM25 index over 246 chunks
Part B: The Full Advanced RAG Pipeline¶
Let’s wire up the complete pipeline from Part 02: hybrid search with RRF fusion, cross-encoder reranking, and LLM generation with citations.
Pipeline Functions¶
def reciprocal_rank_fusion(ranked_lists, k=60):
"""Merge multiple ranked lists using RRF."""
rrf_scores = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list):
if doc_id not in rrf_scores:
rrf_scores[doc_id] = 0
rrf_scores[doc_id] += 1 / (k + rank + 1)
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_search(query, n=20):
"""Run BM25 + dense search and fuse with RRF."""
# BM25
bm25_scores = bm25.get_scores(query.lower().split())
bm25_ranked = bm25_scores.argsort()[::-1][:n].tolist()
# Dense
dense_results = collection.query(
query_embeddings=embed_model.encode([query]).tolist(),
n_results=n,
)
dense_ranked = [int(id.split("_")[1]) for id in dense_results["ids"][0]]
# Fuse
return reciprocal_rank_fusion([bm25_ranked, dense_ranked])
def rerank(query, candidate_indices, top_k=5):
"""Rerank candidates with cross-encoder."""
pairs = [[query, all_chunks[idx]] for idx in candidate_indices]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidate_indices, scores), key=lambda x: x[1], reverse=True)
return reranked[:top_k]
def rag_pipeline(query, top_k=5):
"""Full pipeline: hybrid search → rerank → format context."""
# Retrieve wide
hybrid_results = hybrid_search(query, n=30)
candidate_indices = [idx for idx, _ in hybrid_results]
# Rerank narrow
top_chunks = rerank(query, candidate_indices, top_k=top_k)
# Format context with source attribution
context_parts = []
sources = []
for i, (idx, score) in enumerate(top_chunks):
source = chunk_sources[idx]
context_parts.append(f"[{i+1}] (Source: {source})\n{all_chunks[idx]}")
sources.append(source)
context = "\n\n".join(context_parts)
return context, top_chunks, sourcesTest the Pipeline¶
query = "How do transformer models handle long-range dependencies in text?"
context, top_chunks, sources = rag_pipeline(query)
print(f"Query: '{query}'\n")
print(f"Retrieved {len(top_chunks)} chunks from: {', '.join(dict.fromkeys(sources))}\n")
for i, (idx, score) in enumerate(top_chunks):
print(f"[{i+1}] (rerank: {score:.2f}, source: {chunk_sources[idx]})")
print(f" {all_chunks[idx][:100]}...")
print()Query: 'How do transformer models handle long-range dependencies in text?'
Retrieved 5 chunks from: Transformer (deep learning architecture)
[1] (rerank: 0.71, source: Transformer (deep learning architecture))
One set of
(
W
...
[2] (rerank: 0.37, source: Transformer (deep learning architecture))
In deep learning, the transformer is an artificial neural network architecture based on the multi-he...
[3] (rerank: -0.72, source: Transformer (deep learning architecture))
=== Terminology ===
The transformer architecture, being modular, allows variations. Several common v...
[4] (rerank: -0.99, source: Transformer (deep learning architecture))
=== Sub-quadratic transformers ===
Training transformer-based architectures can be expensive, especi...
[5] (rerank: -1.11, source: Transformer (deep learning architecture))
The attention mechanism used in the transformer architecture are scaled dot-product attention units....
# Generate a grounded answer
rag_agent = Agent(
get_model("claude-sonnet-4-6"),
instructions=(
"Answer the user's question based ONLY on the provided context. "
"Cite your sources using [1], [2], etc. "
"If the context doesn't contain enough information, say so."
),
)
result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
print(result.output)Based on the provided context, transformer models handle long-range dependencies in text through several mechanisms:
1. **Multi-head attention across layers**: The scope of attention can expand as tokens pass through successive layers, allowing the model to "capture more complex and long-range dependencies in deeper layers" [1].
2. **Contextualization within the context window**: At each layer, each token is "contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished" [2].
3. **Multiple attention heads**: Each layer contains multiple attention heads, allowing the model to capture different definitions of "relevance" simultaneously. This means the model can track various types of relationships between tokens across the text at the same time [1].
4. **Scaled dot-product attention**: The attention mechanism computes relationships between any two tokens directly through query-key dot products, regardless of their distance in the sequence. The attention weight between token *i* and token *j* is computed as the dot product between their respective query and key vectors [5].
It is worth noting that for very long inputs, standard transformers can be computationally expensive, which has motivated the development of more efficient variants like the Swin transformer for images and SepTr for audio [4].
Let’s try a few more queries to see the pipeline in action across different topics:
test_queries = [
"What is the difference between BPE and WordPiece tokenization?",
"How does BERT differ from GPT in architecture?",
"What are the limitations of TF-IDF for text retrieval?",
]
for q in test_queries:
ctx, chunks, srcs = rag_pipeline(q, top_k=3)
print(f"Q: {q}")
print(f" Sources: {', '.join(dict.fromkeys(srcs))}")
print(f" Top chunk: {all_chunks[chunks[0][0]][:80]}...")
print()Q: What is the difference between BPE and WordPiece tokenization?
Sources: BERT (language model), Tokenization (lexical analysis)
Top chunk: === Embedding ===
This section describes the embedding used by BERTBASE. The oth...
Q: How does BERT differ from GPT in architecture?
Sources: BERT (language model)
Top chunk: == Interpretation ==
Language models like ELMo, GPT-2, and BERT, spawned the stu...
Q: What are the limitations of TF-IDF for text retrieval?
Sources: Tf–idf
Top chunk: In information retrieval, tf–idf (term frequency–inverse document frequency, TF*...
Part C: Query Transformation with HyDE¶
So far, our queries have been well-formed questions that match the document content reasonably well. But what about vague or short queries? Consider: “attention” — is the user asking about the attention mechanism in transformers? Attention in cognitive science? The word’s dictionary definition?
Short queries produce poor embeddings because there’s not enough context to capture the user’s intent. HyDE (Hypothetical Document Embeddings) is a clever solution: instead of embedding the raw query, we first ask the LLM to generate a hypothetical answer, then embed that and use it for retrieval.
The intuition: a hypothetical answer looks much more like a real document passage than a short query does, so it will land closer to the right chunks in embedding space.
HyDE in Action¶
async def hyde_search(query, n=20):
"""Generate a hypothetical answer, embed it, and search."""
# Step 1: Generate hypothetical answer
hyde_agent = Agent(
get_model("claude-sonnet-4-6"),
instructions=(
"Write a short, factual paragraph (3-4 sentences) that would answer "
"the following question. Write as if you are a textbook. "
"Do not say 'I don't know' — just write your best answer."
),
)
hypothetical = await hyde_agent.run(query)
hypo_text = hypothetical.output
# Step 2: Embed the hypothetical answer (not the original query)
hypo_embedding = embed_model.encode([hypo_text]).tolist()
# Step 3: Search with the hypothetical embedding
results = collection.query(
query_embeddings=hypo_embedding,
n_results=n,
)
return results, hypo_text# Compare standard vs. HyDE retrieval on a short query
short_query = "attention"
# Standard dense search
standard_results = collection.query(
query_embeddings=embed_model.encode([short_query]).tolist(),
n_results=3,
)
print(f"Query: '{short_query}'\n")
print("--- Standard Dense Search ---")
for i, (doc, dist) in enumerate(zip(standard_results["documents"][0], standard_results["distances"][0])):
print(f" [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")
# HyDE search
hyde_results, hypo_text = await hyde_search(short_query)
print(f"\n--- HyDE Search ---")
print(f"Hypothetical answer: {hypo_text[:120]}...\n")
for i, (doc, dist) in enumerate(zip(hyde_results["documents"][0][:3], hyde_results["distances"][0][:3])):
print(f" [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")Query: 'attention'
--- Standard Dense Search ---
[1] (sim: 0.458) One set of
(
...
[2] (sim: 0.453) Seq2seq models with attention (including self-attention) still suffered from the...
[3] (sim: 0.440) Multihead Latent Attention (MLA) is a low-rank approximation to standard MHA. Sp...
--- HyDE Search ---
Hypothetical answer: **Attention** is a cognitive process that allows individuals to selectively focus on specific stimuli or information in ...
[1] (sim: 0.462) One set of
(
...
[2] (sim: 0.413) Each encoder layer contains 2 sublayers: the self-attention and the feedforward ...
[3] (sim: 0.412) === Multiple timescales model ===
A multiple timescales recurrent neural network...
HyDE should retrieve more relevant results for the ambiguous query “attention” — the hypothetical answer provides the context that “attention” refers to the transformer attention mechanism, producing an embedding that lands closer to the right passages.
Part D: Evaluate Your Pipeline¶
Before Week 11’s deep dive into evaluation frameworks, let’s build intuition for what “good RAG” looks like with a simple manual evaluation.
Create a Test Suite¶
The idea: write queries where you know what a good answer should contain, then score your pipeline’s actual answers.
# Test suite: queries with expected content
test_suite = [
{
"query": "What is tokenization and why is it important for NLP?",
"expected_keywords": ["token", "subword", "BPE"],
},
{
"query": "How do word embeddings capture semantic meaning?",
"expected_keywords": ["vector", "Word2Vec", "semantic"],
},
{
"query": "What problem does the attention mechanism solve?",
"expected_keywords": ["attention", "long-range", "parallel"],
},
{
"query": "How does RAG reduce hallucination in language models?",
"expected_keywords": ["retriev", "hallucin", "knowledge"],
},
{
"query": "What are the main differences between BERT and GPT?",
"expected_keywords": ["BERT", "GPT", "encoder"],
},
]
print(f"Test suite: {len(test_suite)} queries")
for i, test in enumerate(test_suite):
print(f" {i+1}. {test['query']}")Test suite: 5 queries
1. What is tokenization and why is it important for NLP?
2. How do word embeddings capture semantic meaning?
3. What problem does the attention mechanism solve?
4. How does RAG reduce hallucination in language models?
5. What are the main differences between BERT and GPT?
Run and Score¶
We’ll score each answer on two dimensions:
Relevance (1–3): Does the retrieved context contain information needed to answer the question?
Faithfulness (1–3): Is the generated answer supported by the retrieved context (not hallucinated)?
print("Running pipeline on test suite...\n")
results = []
for test in test_suite:
query = test["query"]
context, top_chunks, sources = rag_pipeline(query, top_k=3)
result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
answer = result.output
# Check how many expected keywords appear in the context
context_lower = context.lower()
keywords_found = sum(
1 for kw in test["expected_keywords"]
if kw.lower() in context_lower
)
results.append({
"query": query,
"answer": answer,
"sources": sources,
"keywords_found": keywords_found,
"keywords_total": len(test["expected_keywords"]),
})
print(f"Q: {query}")
print(f" Sources: {', '.join(dict.fromkeys(sources))}")
print(f" Keywords in context: {keywords_found}/{len(test['expected_keywords'])}")
print(f" Answer: {answer[:120]}...")
print()Running pipeline on test suite...
Q: What is tokenization and why is it important for NLP?
Sources: Natural language processing
Keywords in context: 1/3
Answer: ## Tokenization in NLP
Based on the provided context, **tokenization** is mentioned as a preprocessing step in NLP pipe...
Q: How do word embeddings capture semantic meaning?
Sources: Word embedding, Natural language processing
Keywords in context: 3/3
Answer: Based on the provided context, word embeddings capture semantic meaning in the following ways:
**Core Representation**
...
Q: What problem does the attention mechanism solve?
Sources: Transformer (deep learning architecture)
Keywords in context: 2/3
Answer: Based on the provided context, the attention mechanism primarily helps solve the **parallelization problem** that plague...
Q: How does RAG reduce hallucination in language models?
Sources: Retrieval-augmented generation
Keywords in context: 3/3
Answer: Based on the provided context, RAG helps **reduce** hallucinations, though it does not eliminate them entirely.
RAG red...
Q: What are the main differences between BERT and GPT?
Sources: Transformer (deep learning architecture), BERT (language model)
Keywords in context: 3/3
Answer: Based on the provided context, I can identify some differences between BERT and GPT, though the information is limited:
...
# Summary
total_found = sum(r["keywords_found"] for r in results)
total_expected = sum(r["keywords_total"] for r in results)
print(f"Context coverage: {total_found}/{total_expected} expected keywords found ({total_found/total_expected*100:.0f}%)")
print()
print("Per-query breakdown:")
for r in results:
score = r["keywords_found"] / r["keywords_total"]
status = "GOOD" if score >= 0.67 else "WEAK" if score >= 0.33 else "POOR"
print(f" [{status}] {r['query'][:50]}... ({r['keywords_found']}/{r['keywords_total']})")Context coverage: 12/15 expected keywords found (80%)
Per-query breakdown:
[WEAK] What is tokenization and why is it important for N... (1/3)
[GOOD] How do word embeddings capture semantic meaning?... (3/3)
[WEAK] What problem does the attention mechanism solve?... (2/3)
[GOOD] How does RAG reduce hallucination in language mode... (3/3)
[GOOD] What are the main differences between BERT and GPT... (3/3)
Part E: Build Your Own RAG System¶
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In Week 11, we’ll formalize everything we did by hand today. You’ll learn the RAGAS framework for automated RAG evaluation — metrics like faithfulness, answer relevance, context precision, and context recall that let you score your pipeline without manually reading every answer. We’ll also explore LLM-as-judge techniques and build evaluation into your development workflow.