Building RAG Pipelines
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
L10.01: RAG foundations — vector databases, ChromaDB basics, embedding models
Outcomes
Compare chunking strategies (fixed-size, recursive, sentence-aware) and explain how chunk design affects retrieval quality
Implement hybrid search combining BM25 keyword matching with dense retrieval
Apply cross-encoder reranking to improve retrieval precision using the “retrieve wide, rerank narrow” pattern
Build a complete Advanced RAG pipeline that chains chunking, hybrid retrieval, and reranking
References
Why 80% of RAG Failures Start Before the LLM¶
In Part 01, we built a working RAG system in about 30 lines of code. It retrieved relevant documents and generated grounded answers. So... are we done?
Not quite. Our demo used six hand-crafted, single-sentence documents. Real RAG systems ingest thousands of pages — PDFs, HTML, markdown files — and the way you prepare those documents for retrieval matters enormously.
Here’s a striking finding from production RAG research: 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM. The model is usually capable of answering the question — it just never gets the right context. Poor chunking leads to irrelevant retrieval, which leads to bad answers, and no amount of prompt engineering can fix that.
Today we’ll tackle the three techniques that transform a naive RAG pipeline into a production-quality one: better chunking, hybrid search, and reranking.
Setup¶
We’ll reuse the same model helper from Part 01, and add a few new imports.
import os
import re
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
import chromadb
load_dotenv()
PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"
def get_model(model_name: str) -> OpenAIChatModel:
"""Create a model connection through our LiteLLM proxy."""
return OpenAIChatModel(
model_name,
provider=OpenAIProvider(
base_url=PROXY_URL,
api_key=os.environ["CAP6640_API_KEY"],
),
)
# Load models we'll use throughout
embed_model = SentenceTransformer("all-MiniLM-L6-v2")Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
We’ll also need a real document to work with — something longer than six sentences. Here’s an excerpt covering several NLP topics, long enough to demonstrate how chunking choices matter:
document = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them. NLP combines computational linguistics with statistical, machine learning, and deep learning models to process human language.
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, subwords, or characters depending on the chosen strategy. Modern NLP systems typically use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece, which balance vocabulary size with the ability to handle rare and out-of-vocabulary words. The choice of tokenization strategy affects all downstream tasks.
Text can be represented as sparse vectors using methods like bag-of-words or TF-IDF, which count word occurrences. Alternatively, dense representations like word embeddings (Word2Vec, GloVe) capture semantic meaning in continuous vector spaces. The key advantage of dense representations is that semantically similar words have similar vectors, enabling operations like analogy solving and similarity search.
The attention mechanism, introduced in the landmark 2017 paper "Attention Is All You Need," allows each position in a sequence to attend to all other positions. This enables transformer models to capture long-range dependencies without the sequential processing bottleneck of recurrent neural networks. Multi-head attention runs multiple attention operations in parallel, letting the model focus on different aspects of the input simultaneously.
Retrieval-augmented generation (RAG) combines information retrieval with language model generation. In a RAG system, relevant documents are first retrieved from an external knowledge base using embedding similarity search. These retrieved passages are then provided as context to a large language model, which generates a response grounded in the specific documents rather than relying solely on its parametric knowledge. This approach reduces hallucination and enables citation of sources.
Large language models like GPT-4, Claude, and Gemini are trained on massive text corpora and can perform a wide variety of NLP tasks through prompting alone. However, they have important limitations: their knowledge is frozen at training time, they can hallucinate plausible-sounding but incorrect information, and they cannot access private or domain-specific data unless it is provided in the prompt or through techniques like RAG.
"""
print(f"Document length: {len(document.split())} words, {len(document)} characters")Document length: 370 words, 2657 characters
Chunking Strategies¶
Chunking is how we split a large document into the smaller pieces that get embedded and stored in our vector database. It sounds straightforward, but the choice of chunking strategy has an outsized effect on retrieval quality. A chunk that’s too large buries the relevant sentence in noise. A chunk that’s too small loses the context needed to understand it.
Let’s compare three common strategies on the same document.
Figure 1:Three chunking strategies applied to the same document produce very different results. The right choice depends on your document structure and query patterns.
Strategy 1: Fixed-Size Chunking¶
The simplest approach — split every N words, regardless of sentence or paragraph boundaries.
def chunk_fixed(text, chunk_size=50, overlap=10):
"""Split text into fixed-size word chunks with overlap."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i : i + chunk_size])
if chunk.strip():
chunks.append(chunk.strip())
return chunks
fixed_chunks = chunk_fixed(document, chunk_size=50, overlap=10)
print(f"Fixed-size chunking: {len(fixed_chunks)} chunks\n")
for i, chunk in enumerate(fixed_chunks[:3]):
print(f"Chunk {i+1} ({len(chunk.split())} words): {chunk[:80]}...")
print()Fixed-size chunking: 10 chunks
Chunk 1 (50 words): Natural language processing (NLP) is a subfield of linguistics, computer science...
Chunk 2 (50 words): the language within them. NLP combines computational linguistics with statistica...
Chunk 3 (50 words): depending on the chosen strategy. Modern NLP systems typically use subword token...
Fixed-size chunks are simple and predictable, but notice the problem: they can split mid-sentence. A chunk might end with “methods like Byte Pair” and the next one starts with “Encoding (BPE) or WordPiece.” Neither chunk alone makes sense for that concept.
Strategy 2: Sentence-Aware Chunking¶
Split on sentence boundaries, then group sentences until we hit a size threshold.
def chunk_sentences(text, max_words=60):
"""Split into sentence groups, respecting sentence boundaries."""
sentences = re.split(r"(?<=[.!?])\s+", text.strip())
chunks = []
current_chunk = []
current_size = 0
for sent in sentences:
sent_words = len(sent.split())
if current_size + sent_words > max_words and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [sent]
current_size = sent_words
else:
current_chunk.append(sent)
current_size += sent_words
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
sentence_chunks = chunk_sentences(document, max_words=60)
print(f"Sentence-aware chunking: {len(sentence_chunks)} chunks\n")
for i, chunk in enumerate(sentence_chunks[:3]):
print(f"Chunk {i+1} ({len(chunk.split())} words): {chunk[:80]}...")
print()Sentence-aware chunking: 8 chunks
Chunk 1 (60 words): Natural language processing (NLP) is a subfield of linguistics, computer science...
Chunk 2 (53 words): Tokenization is the process of breaking text into smaller units called tokens. T...
Chunk 3 (41 words): The choice of tokenization strategy affects all downstream tasks. Text can be re...
Better — no sentence is split in half. But we might still merge unrelated sentences if they happen to be short enough to fit together.
Strategy 3: Paragraph-Based (Recursive) Chunking¶
Respect the document’s natural structure. Split on paragraph boundaries first, then on sentences if a paragraph is too long.
def chunk_paragraphs(text, max_words=80):
"""Split on paragraph boundaries, with sentence fallback for long paragraphs."""
paragraphs = [p.strip() for p in text.strip().split("\n\n") if p.strip()]
chunks = []
for para in paragraphs:
if len(para.split()) <= max_words:
chunks.append(para)
else:
# Fall back to sentence grouping for long paragraphs
chunks.extend(chunk_sentences(para, max_words=max_words))
return chunks
para_chunks = chunk_paragraphs(document, max_words=80)
print(f"Paragraph-based chunking: {len(para_chunks)} chunks\n")
for i, chunk in enumerate(para_chunks[:3]):
print(f"Chunk {i+1} ({len(chunk.split())} words): {chunk[:80]}...")
print()Paragraph-based chunking: 6 chunks
Chunk 1 (60 words): Natural language processing (NLP) is a subfield of linguistics, computer science...
Chunk 2 (62 words): Tokenization is the process of breaking text into smaller units called tokens. T...
Chunk 3 (54 words): Text can be represented as sparse vectors using methods like bag-of-words or TF-...
Paragraph-based chunks preserve topical coherence — each chunk is about one concept. This is the approach used by LangChain’s RecursiveCharacterTextSplitter, which tries paragraph breaks first, then sentences, then words.
Which Strategy Wins?¶
Let’s test all three on the same query:
query = "How does RAG reduce hallucination?"
def retrieve_top(chunks, query, embed_model, n=2):
"""Quick dense retrieval over a list of chunks."""
q_emb = embed_model.encode([query])
c_embs = embed_model.encode(chunks)
from sentence_transformers.util import cos_sim
sims = cos_sim(q_emb, c_embs)[0]
top_indices = sims.argsort(descending=True)[:n]
return [(chunks[i], sims[i].item()) for i in top_indices]
print("=" * 70)
print(f"Query: '{query}'\n")
for name, chunks in [("Fixed-size", fixed_chunks), ("Sentence-aware", sentence_chunks), ("Paragraph-based", para_chunks)]:
results = retrieve_top(chunks, query, embed_model, n=1)
chunk_text, score = results[0]
print(f"--- {name} (top result, similarity: {score:.3f}) ---")
print(f"{chunk_text[:120]}...")
print()======================================================================
Query: 'How does RAG reduce hallucination?'
--- Fixed-size (top result, similarity: 0.512) ---
is provided in the prompt or through techniques like RAG....
--- Sentence-aware (top result, similarity: 0.227) ---
In a RAG system, relevant documents are first retrieved from an external knowledge base using embedding similarity searc...
--- Paragraph-based (top result, similarity: 0.203) ---
Retrieval-augmented generation (RAG) combines information retrieval with language model generation. In a RAG system, rel...
Look closely at the results. Fixed-size chunking may score highest on raw similarity — but examine the actual chunk content. Does it contain a complete, self-contained answer about how RAG reduces hallucination? Or is it a fragment that cuts off mid-thought?
The paragraph-based strategy typically returns a chunk that covers the full RAG concept with its relationship to hallucination intact, even if its similarity score is lower. This is a crucial insight: similarity score alone doesn’t tell you if the chunk will produce a good answer. A coherent, topically complete chunk at slightly lower similarity often beats a high-scoring fragment.
The takeaway: start with paragraph or recursive chunking as your default. Only move to more complex strategies (semantic chunking, document-aware splitting) when the document structure demands it.
Hybrid Search: Best of Both Worlds¶
In Part 01, we used dense vector search exclusively — embed the query, find the nearest document vectors. But recall the weakness: dense retrieval can miss exact keyword matches. If the user asks about “BPE tokenization” and the document says “Byte Pair Encoding (BPE),” a keyword-based system would nail it, while a dense retriever might rank a general passage about tokenization higher.
The solution is hybrid search: run both sparse (keyword) and dense (semantic) retrieval, then combine the results.
Figure 2:The Advanced RAG retrieval pipeline: run BM25 and dense search in parallel, merge with Reciprocal Rank Fusion, then rerank with a cross-encoder for maximum precision.
BM25: The Keyword Side¶
BM25 is the algorithm behind most traditional search engines. It scores documents by how well their words match the query, accounting for term frequency and document length. We introduced BM25 conceptually in Part 01 — now let’s use it.
# We'll use paragraph chunks for the rest of this lecture
chunks = para_chunks
print(f"Working with {len(chunks)} paragraph-based chunks\n")
# BM25 needs tokenized documents (just lowercase word splits for now)
tokenized_chunks = [chunk.lower().split() for chunk in chunks]
bm25 = BM25Okapi(tokenized_chunks)
# Try a keyword-heavy query
query_keyword = "BPE tokenization subword"
bm25_scores = bm25.get_scores(query_keyword.lower().split())
print(f"Query: '{query_keyword}'\n")
print("BM25 ranking:")
for rank, idx in enumerate(bm25_scores.argsort()[::-1][:3]):
print(f" [{rank+1}] (score: {bm25_scores[idx]:.2f}) {chunks[idx][:70]}...")Working with 6 paragraph-based chunks
Query: 'BPE tokenization subword'
BM25 ranking:
[1] (score: 3.46) Tokenization is the process of breaking text into smaller units called...
[2] (score: 0.00) Large language models like GPT-4, Claude, and Gemini are trained on ma...
[3] (score: 0.00) Retrieval-augmented generation (RAG) combines information retrieval wi...
BM25 excels here because the query contains specific terms (“BPE”, “tokenization”, “subword”) that appear directly in the document.
Dense Search: The Semantic Side¶
Now let’s try the same query with dense retrieval:
# Set up ChromaDB with our paragraph chunks
client = chromadb.Client()
collection = client.create_collection("advanced_rag", metadata={"hnsw:space": "cosine"})
chunk_embeddings = embed_model.encode(chunks).tolist()
collection.add(
ids=[f"chunk_{i}" for i in range(len(chunks))],
documents=chunks,
embeddings=chunk_embeddings,
)
# Same query, dense retrieval
dense_results = collection.query(
query_embeddings=embed_model.encode([query_keyword]).tolist(),
n_results=3,
)
print(f"Query: '{query_keyword}'\n")
print("Dense ranking:")
for rank, (doc, dist) in enumerate(zip(dense_results["documents"][0], dense_results["distances"][0])):
print(f" [{rank+1}] (similarity: {1-dist:.3f}) {doc[:70]}...")Query: 'BPE tokenization subword'
Dense ranking:
[1] (similarity: 0.741) Tokenization is the process of breaking text into smaller units called...
[2] (similarity: 0.259) Natural language processing (NLP) is a subfield of linguistics, comput...
[3] (similarity: 0.258) Text can be represented as sparse vectors using methods like bag-of-wo...
Dense retrieval finds semantically related passages — but it might rank a general NLP overview higher than the specific BPE paragraph, because “tokenization subword” is semantically close to many NLP concepts.
Reciprocal Rank Fusion: Combining the Best of Both¶
How do we merge two ranked lists into one? The simplest and most effective method is Reciprocal Rank Fusion (RRF). For each document, we sum 1 / (k + rank) across all ranking systems:
where is a constant (typically 60) that dampens the influence of high-ranking outliers.
def reciprocal_rank_fusion(ranked_lists, k=60):
"""Merge multiple ranked lists using RRF. Each list is [(doc_id, ...), ...]."""
rrf_scores = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list):
if doc_id not in rrf_scores:
rrf_scores[doc_id] = 0
rrf_scores[doc_id] += 1 / (k + rank + 1) # rank is 0-indexed
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_search(query, chunks, bm25, collection, embed_model, n=5):
"""Run BM25 + dense search and fuse with RRF."""
# BM25 ranking
bm25_scores = bm25.get_scores(query.lower().split())
bm25_ranked = bm25_scores.argsort()[::-1][:n].tolist()
# Dense ranking
dense_results = collection.query(
query_embeddings=embed_model.encode([query]).tolist(),
n_results=n,
)
dense_ranked = [int(id.split("_")[1]) for id in dense_results["ids"][0]]
# Fuse
fused = reciprocal_rank_fusion([bm25_ranked, dense_ranked])
return fused
# Test hybrid search
query_test = "How does RAG use retrieved documents to reduce hallucination?"
hybrid_results = hybrid_search(query_test, chunks, bm25, collection, embed_model)
print(f"Query: '{query_test}'\n")
print("Hybrid (RRF) ranking:")
for rank, (idx, score) in enumerate(hybrid_results[:3]):
print(f" [{rank+1}] (RRF score: {score:.4f}) {chunks[idx][:70]}...")Query: 'How does RAG use retrieved documents to reduce hallucination?'
Hybrid (RRF) ranking:
[1] (RRF score: 0.0328) Retrieval-augmented generation (RAG) combines information retrieval wi...
[2] (RRF score: 0.0323) Tokenization is the process of breaking text into smaller units called...
[3] (RRF score: 0.0317) The attention mechanism, introduced in the landmark 2017 paper "Attent...
Hybrid search captures both exact keyword matches and semantic meaning. A query with specific terms benefits from BM25, while a paraphrased question benefits from dense retrieval. The fusion ensures you get the best of both.
Reranking: Retrieve Wide, Rerank Narrow¶
Hybrid search gets us better recall — we’re less likely to miss relevant documents. But we still have a precision problem. Our top-5 results might include some noise, and the ordering within those top-5 isn’t optimal. This is where reranking comes in.
The Bi-Encoder vs. Cross-Encoder Trade-Off¶
In Part 01 and the hybrid search above, we used bi-encoders — models that encode the query and document separately, then compare their vectors. This is fast (encode once, compare many), but the model never sees the query and document together, so it can miss subtle interactions between them.
A cross-encoder takes a different approach: it feeds the query and document together through a transformer, producing a single relevance score. This is much more accurate — the model can attend to fine-grained interactions between query terms and document terms — but it’s slow, because you need a separate forward pass for every (query, document) pair.
The solution? Use both:
Retrieve wide — Use fast hybrid search to fetch the top 20–30 candidates (high recall, okay precision)
Rerank narrow — Score all candidates with a cross-encoder, keep the top 3–5 (high precision)
This “retrieve wide, rerank narrow” pattern is the single biggest precision gain you can add to a RAG pipeline.
Cross-Encoder Reranking in Practice¶
# Load a cross-encoder reranking model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query_rerank = "What are the key limitations of using LLMs without retrieval?"
# Step 1: Retrieve wide with hybrid search (all chunks in our small example)
hybrid_results = hybrid_search(query_rerank, chunks, bm25, collection, embed_model, n=len(chunks))
candidate_indices = [idx for idx, _ in hybrid_results]
# Step 2: Rerank with cross-encoder
pairs = [[query_rerank, chunks[idx]] for idx in candidate_indices]
rerank_scores = reranker.predict(pairs)
# Sort by cross-encoder score
reranked = sorted(
zip(candidate_indices, rerank_scores),
key=lambda x: x[1],
reverse=True,
)
print(f"Query: '{query_rerank}'\n")
print("Before reranking (hybrid order):")
for rank, (idx, rrf_score) in enumerate(hybrid_results[:3]):
print(f" [{rank+1}] {chunks[idx][:80]}...")
print("\nAfter cross-encoder reranking:")
for rank, (idx, score) in enumerate(reranked[:3]):
print(f" [{rank+1}] (score: {score:.3f}) {chunks[idx][:80]}...")BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key | Status | |
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Query: 'What are the key limitations of using LLMs without retrieval?'
Before reranking (hybrid order):
[1] Text can be represented as sparse vectors using methods like bag-of-words or TF-...
[2] Retrieval-augmented generation (RAG) combines information retrieval with languag...
[3] The attention mechanism, introduced in the landmark 2017 paper "Attention Is All...
After cross-encoder reranking:
[1] (score: -6.523) Large language models like GPT-4, Claude, and Gemini are trained on massive text...
[2] (score: -7.372) Retrieval-augmented generation (RAG) combines information retrieval with languag...
[3] (score: -10.907) The attention mechanism, introduced in the landmark 2017 paper "Attention Is All...
Watch what happens: hybrid search may rank a tangentially related passage at the top (e.g., one that mentions “models” but isn’t about LLM limitations). The cross-encoder, which sees query and document together, promotes the passage that directly discusses LLM limitations — frozen knowledge, hallucination, and the need for RAG. This reordering is the whole point of the “retrieve wide, rerank narrow” pattern.
Putting It All Together¶
Let’s chain the complete Advanced RAG pipeline: paragraph chunking → hybrid retrieval → cross-encoder reranking → LLM generation. Then we’ll compare the answer quality to a naive approach.
def advanced_rag(query, chunks, bm25, collection, embed_model, reranker, top_k=3):
"""Full Advanced RAG pipeline: hybrid search + reranking + generation."""
# 1. Hybrid retrieve (wide)
hybrid_results = hybrid_search(query, chunks, bm25, collection, embed_model, n=min(len(chunks), 20))
candidate_indices = [idx for idx, _ in hybrid_results]
# 2. Rerank (narrow)
pairs = [[query, chunks[idx]] for idx in candidate_indices]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidate_indices, scores), key=lambda x: x[1], reverse=True)
top_chunks = [(chunks[idx], score) for idx, score in reranked[:top_k]]
# 3. Build context
context = "\n".join(f"[{i+1}] {chunk}" for i, (chunk, _) in enumerate(top_chunks))
return context, top_chunks
query_final = "Why is RAG better than just using a language model alone?"
context, top_chunks = advanced_rag(
query_final, chunks, bm25, collection, embed_model, reranker
)
print("Retrieved context (after hybrid search + reranking):\n")
for i, (chunk, score) in enumerate(top_chunks):
print(f"[{i+1}] (rerank score: {score:.3f})")
print(f" {chunk[:100]}...")
print()Retrieved context (after hybrid search + reranking):
[1] (rerank score: 3.981)
Retrieval-augmented generation (RAG) combines information retrieval with language model generation. ...
[2] (rerank score: 2.349)
Large language models like GPT-4, Claude, and Gemini are trained on massive text corpora and can per...
[3] (rerank score: -10.200)
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial int...
# Generate a grounded answer
rag_agent = Agent(
get_model("claude-sonnet-4-6"),
instructions=(
"Answer the user's question based ONLY on the provided context. "
"Cite your sources using [1], [2], etc. "
"If the context doesn't contain enough information, say so."
),
)
prompt = f"""Context:
{context}
Question: {query_final}"""
result = await rag_agent.run(prompt)
print(f"Query: {query_final}\n")
print(f"Answer:\n{result.output}")Query: Why is RAG better than just using a language model alone?
Answer:
Based on the provided context, RAG offers several advantages over using a language model alone:
1. **Reduces hallucination**: Standard language models "can hallucinate plausible-sounding but incorrect information" [2], whereas RAG grounds the model's response "in the specific documents rather than relying solely on its parametric knowledge," which "reduces hallucination" [1].
2. **Overcomes frozen knowledge**: Language models have knowledge that is "frozen at training time" [2], but RAG addresses this by retrieving information from an **external knowledge base** at query time, allowing access to more current information [1].
3. **Enables access to private/domain-specific data**: LLMs "cannot access private or domain-specific data unless it is provided in the prompt or through techniques like RAG" [2], making RAG a practical solution for specialized or proprietary information.
4. **Provides source citations**: RAG enables **citation of sources** [1], making responses more transparent and verifiable compared to a standalone language model.
In essence, RAG combines the language understanding and generation strengths of LLMs while compensating for their key limitations around knowledge currency, accuracy, and data access [1][2].
Compare this with the naive approach from Part 01 — using a single retrieval method, no reranking, potentially with poorly chunked documents. The Advanced RAG pipeline produces more focused context, which leads to more accurate, better-cited answers.
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In Part 03, you’ll build and evaluate your own RAG system from scratch in the lab. You’ll choose a document corpus, experiment with the chunking and retrieval strategies we covered today, and measure how each improvement affects answer quality. We’ll also explore query transformation techniques like HyDE (Hypothetical Document Embeddings) that can further boost retrieval performance.