Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab — RAG Builder

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Setup

We’ll reuse the same tools from Parts 01 and 02. This cell loads everything we need.

import os
import re
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import cos_sim
from rank_bm25 import BM25Okapi
import chromadb
import wikipedia

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )


embed_model = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading...
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading...
BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Part A: Build Your Corpus

In Parts 01 and 02, we worked with toy documents — a handful of hand-written sentences, then a single multi-paragraph text. Today we’ll build a RAG system over a real corpus: 10 Wikipedia articles covering topics we’ve studied this semester.

Fetching the Articles

topics = [
    "Natural language processing",
    "Tokenization (lexical analysis)",
    "Word embedding",
    "Tf–idf",
    "Named-entity recognition",
    "Recurrent neural network",
    "Transformer (deep learning architecture)",
    "BERT (language model)",
    "GPT-4",
    "Retrieval-augmented generation",
]

articles = {}
for topic in topics:
    try:
        page = wikipedia.page(topic, auto_suggest=False)
        articles[topic] = page.content
        print(f"  {topic}: {len(page.content.split())} words")
    except Exception as e:
        print(f"  {topic}: FAILED ({e})")

print(f"\nLoaded {len(articles)} articles")
  Natural language processing: 4538 words
  Tokenization (lexical analysis): 3154 words
  Word embedding: 1476 words
  Tf–idf: 2699 words
  Named-entity recognition: 2080 words
  Recurrent neural network: 5859 words
  Transformer (deep learning architecture): 9894 words
  BERT (language model): 2481 words
  GPT-4: 2268 words
  Retrieval-augmented generation: 1616 words

Loaded 10 articles

Chunking the Corpus

We’ll use paragraph-based chunking — the best default from Part 02. Each chunk gets metadata tracking which article it came from, so we can cite sources later.

def chunk_paragraphs(text, max_words=100):
    """Split on paragraph boundaries, grouping short paragraphs."""
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip() and len(p.split()) > 10]
    chunks = []
    current = []
    current_size = 0
    for para in paragraphs:
        words = len(para.split())
        if current_size + words > max_words and current:
            chunks.append(" ".join(current))
            current = [para]
            current_size = words
        else:
            current.append(para)
            current_size += words
    if current:
        chunks.append(" ".join(current))
    return chunks


# Chunk all articles, tracking source
all_chunks = []
chunk_sources = []

for topic, content in articles.items():
    topic_chunks = chunk_paragraphs(content, max_words=100)
    for chunk in topic_chunks:
        all_chunks.append(chunk)
        chunk_sources.append(topic)

print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f} words")
print(f"\nChunks per article:")
for topic in articles:
    count = chunk_sources.count(topic)
    print(f"  {topic}: {count} chunks")
Total chunks: 246
Average chunk size: 144 words

Chunks per article:
  Natural language processing: 26 chunks
  Tokenization (lexical analysis): 21 chunks
  Word embedding: 9 chunks
  Tf–idf: 31 chunks
  Named-entity recognition: 14 chunks
  Recurrent neural network: 43 chunks
  Transformer (deep learning architecture): 53 chunks
  BERT (language model): 20 chunks
  GPT-4: 15 chunks
  Retrieval-augmented generation: 14 chunks

Indexing: Vector Store + BM25

Now we set up both retrieval backends — dense (ChromaDB) and sparse (BM25) — so we can run hybrid search.

# Dense index (ChromaDB)
client = chromadb.Client()

# Delete collection if it exists from a previous run
try:
    client.delete_collection("wiki_rag")
except Exception:
    pass

collection = client.create_collection("wiki_rag", metadata={"hnsw:space": "cosine"})

# Embed and store all chunks
print("Embedding chunks... ", end="")
chunk_embeddings = embed_model.encode(all_chunks, show_progress_bar=True).tolist()

collection.add(
    ids=[f"chunk_{i}" for i in range(len(all_chunks))],
    documents=all_chunks,
    embeddings=chunk_embeddings,
    metadatas=[{"source": src} for src in chunk_sources],
)
print(f"Indexed {collection.count()} chunks in ChromaDB")

# Sparse index (BM25)
tokenized_chunks = [chunk.lower().split() for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_chunks)
print(f"Built BM25 index over {len(tokenized_chunks)} chunks")
Embedding chunks... 
Loading...
Indexed 246 chunks in ChromaDB
Built BM25 index over 246 chunks

Part B: The Full Advanced RAG Pipeline

Let’s wire up the complete pipeline from Part 02: hybrid search with RRF fusion, cross-encoder reranking, and LLM generation with citations.

Pipeline Functions

def reciprocal_rank_fusion(ranked_lists, k=60):
    """Merge multiple ranked lists using RRF."""
    rrf_scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = 0
            rrf_scores[doc_id] += 1 / (k + rank + 1)
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)


def hybrid_search(query, n=20):
    """Run BM25 + dense search and fuse with RRF."""
    # BM25
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_ranked = bm25_scores.argsort()[::-1][:n].tolist()

    # Dense
    dense_results = collection.query(
        query_embeddings=embed_model.encode([query]).tolist(),
        n_results=n,
    )
    dense_ranked = [int(id.split("_")[1]) for id in dense_results["ids"][0]]

    # Fuse
    return reciprocal_rank_fusion([bm25_ranked, dense_ranked])


def rerank(query, candidate_indices, top_k=5):
    """Rerank candidates with cross-encoder."""
    pairs = [[query, all_chunks[idx]] for idx in candidate_indices]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(candidate_indices, scores), key=lambda x: x[1], reverse=True)
    return reranked[:top_k]


def rag_pipeline(query, top_k=5):
    """Full pipeline: hybrid search → rerank → format context."""
    # Retrieve wide
    hybrid_results = hybrid_search(query, n=30)
    candidate_indices = [idx for idx, _ in hybrid_results]

    # Rerank narrow
    top_chunks = rerank(query, candidate_indices, top_k=top_k)

    # Format context with source attribution
    context_parts = []
    sources = []
    for i, (idx, score) in enumerate(top_chunks):
        source = chunk_sources[idx]
        context_parts.append(f"[{i+1}] (Source: {source})\n{all_chunks[idx]}")
        sources.append(source)

    context = "\n\n".join(context_parts)
    return context, top_chunks, sources

Test the Pipeline

query = "How do transformer models handle long-range dependencies in text?"
context, top_chunks, sources = rag_pipeline(query)

print(f"Query: '{query}'\n")
print(f"Retrieved {len(top_chunks)} chunks from: {', '.join(dict.fromkeys(sources))}\n")
for i, (idx, score) in enumerate(top_chunks):
    print(f"[{i+1}] (rerank: {score:.2f}, source: {chunk_sources[idx]})")
    print(f"    {all_chunks[idx][:100]}...")
    print()
Query: 'How do transformer models handle long-range dependencies in text?'

Retrieved 5 chunks from: Transformer (deep learning architecture)

[1] (rerank: 0.71, source: Transformer (deep learning architecture))
    One set of 
  
    
      
        
          (
          
            
              W
            ...

[2] (rerank: 0.37, source: Transformer (deep learning architecture))
    In deep learning, the transformer is an artificial neural network architecture based on the multi-he...

[3] (rerank: -0.72, source: Transformer (deep learning architecture))
    === Terminology ===
The transformer architecture, being modular, allows variations. Several common v...

[4] (rerank: -0.99, source: Transformer (deep learning architecture))
    === Sub-quadratic transformers ===
Training transformer-based architectures can be expensive, especi...

[5] (rerank: -1.11, source: Transformer (deep learning architecture))
    The attention mechanism used in the transformer architecture are scaled dot-product attention units....

# Generate a grounded answer
rag_agent = Agent(
    get_model("claude-sonnet-4-6"),
    instructions=(
        "Answer the user's question based ONLY on the provided context. "
        "Cite your sources using [1], [2], etc. "
        "If the context doesn't contain enough information, say so."
    ),
)

result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
print(result.output)
Based on the provided context, transformer models handle long-range dependencies in text through several mechanisms:

1. **Multi-head attention across layers**: The scope of attention can expand as tokens pass through successive layers, allowing the model to "capture more complex and long-range dependencies in deeper layers" [1].

2. **Contextualization within the context window**: At each layer, each token is "contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished" [2].

3. **Multiple attention heads**: Each layer contains multiple attention heads, allowing the model to capture different definitions of "relevance" simultaneously. This means the model can track various types of relationships between tokens across the text at the same time [1].

4. **Scaled dot-product attention**: The attention mechanism computes relationships between any two tokens directly through query-key dot products, regardless of their distance in the sequence. The attention weight between token *i* and token *j* is computed as the dot product between their respective query and key vectors [5].

It is worth noting that for very long inputs, standard transformers can be computationally expensive, which has motivated the development of more efficient variants like the Swin transformer for images and SepTr for audio [4].

Let’s try a few more queries to see the pipeline in action across different topics:

test_queries = [
    "What is the difference between BPE and WordPiece tokenization?",
    "How does BERT differ from GPT in architecture?",
    "What are the limitations of TF-IDF for text retrieval?",
]

for q in test_queries:
    ctx, chunks, srcs = rag_pipeline(q, top_k=3)
    print(f"Q: {q}")
    print(f"   Sources: {', '.join(dict.fromkeys(srcs))}")
    print(f"   Top chunk: {all_chunks[chunks[0][0]][:80]}...")
    print()
Q: What is the difference between BPE and WordPiece tokenization?
   Sources: BERT (language model), Tokenization (lexical analysis)
   Top chunk: === Embedding ===
This section describes the embedding used by BERTBASE. The oth...

Q: How does BERT differ from GPT in architecture?
   Sources: BERT (language model)
   Top chunk: == Interpretation ==
Language models like ELMo, GPT-2, and BERT, spawned the stu...

Q: What are the limitations of TF-IDF for text retrieval?
   Sources: Tf–idf
   Top chunk: In information retrieval, tf–idf (term frequency–inverse document frequency, TF*...


Part C: Query Transformation with HyDE

So far, our queries have been well-formed questions that match the document content reasonably well. But what about vague or short queries? Consider: “attention” — is the user asking about the attention mechanism in transformers? Attention in cognitive science? The word’s dictionary definition?

Short queries produce poor embeddings because there’s not enough context to capture the user’s intent. HyDE (Hypothetical Document Embeddings) is a clever solution: instead of embedding the raw query, we first ask the LLM to generate a hypothetical answer, then embed that and use it for retrieval.

The intuition: a hypothetical answer looks much more like a real document passage than a short query does, so it will land closer to the right chunks in embedding space.

HyDE in Action

async def hyde_search(query, n=20):
    """Generate a hypothetical answer, embed it, and search."""
    # Step 1: Generate hypothetical answer
    hyde_agent = Agent(
        get_model("claude-sonnet-4-6"),
        instructions=(
            "Write a short, factual paragraph (3-4 sentences) that would answer "
            "the following question. Write as if you are a textbook. "
            "Do not say 'I don't know' — just write your best answer."
        ),
    )
    hypothetical = await hyde_agent.run(query)
    hypo_text = hypothetical.output

    # Step 2: Embed the hypothetical answer (not the original query)
    hypo_embedding = embed_model.encode([hypo_text]).tolist()

    # Step 3: Search with the hypothetical embedding
    results = collection.query(
        query_embeddings=hypo_embedding,
        n_results=n,
    )

    return results, hypo_text
# Compare standard vs. HyDE retrieval on a short query
short_query = "attention"

# Standard dense search
standard_results = collection.query(
    query_embeddings=embed_model.encode([short_query]).tolist(),
    n_results=3,
)

print(f"Query: '{short_query}'\n")
print("--- Standard Dense Search ---")
for i, (doc, dist) in enumerate(zip(standard_results["documents"][0], standard_results["distances"][0])):
    print(f"  [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")

# HyDE search
hyde_results, hypo_text = await hyde_search(short_query)

print(f"\n--- HyDE Search ---")
print(f"Hypothetical answer: {hypo_text[:120]}...\n")
for i, (doc, dist) in enumerate(zip(hyde_results["documents"][0][:3], hyde_results["distances"][0][:3])):
    print(f"  [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")
Query: 'attention'

--- Standard Dense Search ---
  [1] (sim: 0.458) One set of 
  
    
      
        
          (
          
            
        ...
  [2] (sim: 0.453) Seq2seq models with attention (including self-attention) still suffered from the...
  [3] (sim: 0.440) Multihead Latent Attention (MLA) is a low-rank approximation to standard MHA. Sp...

--- HyDE Search ---
Hypothetical answer: **Attention** is a cognitive process that allows individuals to selectively focus on specific stimuli or information in ...

  [1] (sim: 0.462) One set of 
  
    
      
        
          (
          
            
        ...
  [2] (sim: 0.413) Each encoder layer contains 2 sublayers: the self-attention and the feedforward ...
  [3] (sim: 0.412) === Multiple timescales model ===
A multiple timescales recurrent neural network...

HyDE should retrieve more relevant results for the ambiguous query “attention” — the hypothetical answer provides the context that “attention” refers to the transformer attention mechanism, producing an embedding that lands closer to the right passages.


Part D: Evaluate Your Pipeline

Before Week 11’s deep dive into evaluation frameworks, let’s build intuition for what “good RAG” looks like with a simple manual evaluation.

Create a Test Suite

The idea: write queries where you know what a good answer should contain, then score your pipeline’s actual answers.

# Test suite: queries with expected content
test_suite = [
    {
        "query": "What is tokenization and why is it important for NLP?",
        "expected_keywords": ["token", "subword", "BPE"],
    },
    {
        "query": "How do word embeddings capture semantic meaning?",
        "expected_keywords": ["vector", "Word2Vec", "semantic"],
    },
    {
        "query": "What problem does the attention mechanism solve?",
        "expected_keywords": ["attention", "long-range", "parallel"],
    },
    {
        "query": "How does RAG reduce hallucination in language models?",
        "expected_keywords": ["retriev", "hallucin", "knowledge"],
    },
    {
        "query": "What are the main differences between BERT and GPT?",
        "expected_keywords": ["BERT", "GPT", "encoder"],
    },
]

print(f"Test suite: {len(test_suite)} queries")
for i, test in enumerate(test_suite):
    print(f"  {i+1}. {test['query']}")
Test suite: 5 queries
  1. What is tokenization and why is it important for NLP?
  2. How do word embeddings capture semantic meaning?
  3. What problem does the attention mechanism solve?
  4. How does RAG reduce hallucination in language models?
  5. What are the main differences between BERT and GPT?

Run and Score

We’ll score each answer on two dimensions:

print("Running pipeline on test suite...\n")

results = []
for test in test_suite:
    query = test["query"]
    context, top_chunks, sources = rag_pipeline(query, top_k=3)

    result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
    answer = result.output

    # Check how many expected keywords appear in the context
    context_lower = context.lower()
    keywords_found = sum(
        1 for kw in test["expected_keywords"]
        if kw.lower() in context_lower
    )

    results.append({
        "query": query,
        "answer": answer,
        "sources": sources,
        "keywords_found": keywords_found,
        "keywords_total": len(test["expected_keywords"]),
    })

    print(f"Q: {query}")
    print(f"   Sources: {', '.join(dict.fromkeys(sources))}")
    print(f"   Keywords in context: {keywords_found}/{len(test['expected_keywords'])}")
    print(f"   Answer: {answer[:120]}...")
    print()
Running pipeline on test suite...

Q: What is tokenization and why is it important for NLP?
   Sources: Natural language processing
   Keywords in context: 1/3
   Answer: ## Tokenization in NLP

Based on the provided context, **tokenization** is mentioned as a preprocessing step in NLP pipe...

Q: How do word embeddings capture semantic meaning?
   Sources: Word embedding, Natural language processing
   Keywords in context: 3/3
   Answer: Based on the provided context, word embeddings capture semantic meaning in the following ways:

**Core Representation**
...

Q: What problem does the attention mechanism solve?
   Sources: Transformer (deep learning architecture)
   Keywords in context: 2/3
   Answer: Based on the provided context, the attention mechanism primarily helps solve the **parallelization problem** that plague...

Q: How does RAG reduce hallucination in language models?
   Sources: Retrieval-augmented generation
   Keywords in context: 3/3
   Answer: Based on the provided context, RAG helps **reduce** hallucinations, though it does not eliminate them entirely.

RAG red...

Q: What are the main differences between BERT and GPT?
   Sources: Transformer (deep learning architecture), BERT (language model)
   Keywords in context: 3/3
   Answer: Based on the provided context, I can identify some differences between BERT and GPT, though the information is limited:
...

# Summary
total_found = sum(r["keywords_found"] for r in results)
total_expected = sum(r["keywords_total"] for r in results)
print(f"Context coverage: {total_found}/{total_expected} expected keywords found ({total_found/total_expected*100:.0f}%)")
print()
print("Per-query breakdown:")
for r in results:
    score = r["keywords_found"] / r["keywords_total"]
    status = "GOOD" if score >= 0.67 else "WEAK" if score >= 0.33 else "POOR"
    print(f"  [{status}] {r['query'][:50]}... ({r['keywords_found']}/{r['keywords_total']})")
Context coverage: 12/15 expected keywords found (80%)

Per-query breakdown:
  [WEAK] What is tokenization and why is it important for N... (1/3)
  [GOOD] How do word embeddings capture semantic meaning?... (3/3)
  [WEAK] What problem does the attention mechanism solve?... (2/3)
  [GOOD] How does RAG reduce hallucination in language mode... (3/3)
  [GOOD] What are the main differences between BERT and GPT... (3/3)

Part E: Build Your Own RAG System


Wrap-Up

Key Takeaways

What’s Next

In Week 11, we’ll formalize everything we did by hand today. You’ll learn the RAGAS framework for automated RAG evaluation — metrics like faithfulness, answer relevance, context precision, and context recall that let you score your pipeline without manually reading every answer. We’ll also explore LLM-as-judge techniques and build evaluation into your development workflow.