Lab — RAG Builder - UCF CAP-6640

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L10.01: RAG foundations — vector databases, ChromaDB, embedding models
L10.02: Building RAG pipelines — chunking, hybrid search, reranking with cross-encoders

Outcomes

Build a complete RAG system over a real multi-document corpus (Wikipedia articles)
Apply and compare the chunking, retrieval, and reranking techniques from Parts 01–02
Implement HyDE (Hypothetical Document Embeddings) as a query transformation technique
Evaluate RAG pipeline quality using manual scoring on a test query set

References

Setup¶

We’ll reuse the same tools from Parts 01 and 02. This cell loads everything we need.

import os
import re
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider
from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import cos_sim
from rank_bm25 import BM25Okapi
import chromadb
import wikipedia

load_dotenv()

PROXY_URL = "https://litellm.6640.ucf.spencerlyon.com"


def get_model(model_name: str) -> OpenAIChatModel:
    """Create a model connection through our LiteLLM proxy."""
    return OpenAIChatModel(
        model_name,
        provider=OpenAIProvider(
            base_url=PROXY_URL,
            api_key=os.environ["CAP6640_API_KEY"],
        ),
    )


embed_model = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.

Part A: Build Your Corpus¶

In Parts 01 and 02, we worked with toy documents — a handful of hand-written sentences, then a single multi-paragraph text. Today we’ll build a RAG system over a real corpus: 10 Wikipedia articles covering topics we’ve studied this semester.

Fetching the Articles¶

topics = [
    "Natural language processing",
    "Tokenization (lexical analysis)",
    "Word embedding",
    "Tf–idf",
    "Named-entity recognition",
    "Recurrent neural network",
    "Transformer (deep learning architecture)",
    "BERT (language model)",
    "GPT-4",
    "Retrieval-augmented generation",
]

articles = {}
for topic in topics:
    try:
        page = wikipedia.page(topic, auto_suggest=False)
        articles[topic] = page.content
        print(f"  {topic}: {len(page.content.split())} words")
    except Exception as e:
        print(f"  {topic}: FAILED ({e})")

print(f"\nLoaded {len(articles)} articles")

  Natural language processing: 4538 words

  Tokenization (lexical analysis): 3154 words
  Word embedding: 1476 words

  Tf–idf: 2699 words

  Named-entity recognition: 2080 words

  Recurrent neural network: 5859 words

  Transformer (deep learning architecture): 9894 words

  BERT (language model): 2481 words
  GPT-4: 2268 words

  Retrieval-augmented generation: 1616 words

Loaded 10 articles

Chunking the Corpus¶

We’ll use paragraph-based chunking — the best default from Part 02. Each chunk gets metadata tracking which article it came from, so we can cite sources later.

def chunk_paragraphs(text, max_words=100):
    """Split on paragraph boundaries, grouping short paragraphs."""
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip() and len(p.split()) > 10]
    chunks = []
    current = []
    current_size = 0
    for para in paragraphs:
        words = len(para.split())
        if current_size + words > max_words and current:
            chunks.append(" ".join(current))
            current = [para]
            current_size = words
        else:
            current.append(para)
            current_size += words
    if current:
        chunks.append(" ".join(current))
    return chunks


# Chunk all articles, tracking source
all_chunks = []
chunk_sources = []

for topic, content in articles.items():
    topic_chunks = chunk_paragraphs(content, max_words=100)
    for chunk in topic_chunks:
        all_chunks.append(chunk)
        chunk_sources.append(topic)

print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f} words")
print(f"\nChunks per article:")
for topic in articles:
    count = chunk_sources.count(topic)
    print(f"  {topic}: {count} chunks")

Total chunks: 246
Average chunk size: 144 words

Chunks per article:
  Natural language processing: 26 chunks
  Tokenization (lexical analysis): 21 chunks
  Word embedding: 9 chunks
  Tf–idf: 31 chunks
  Named-entity recognition: 14 chunks
  Recurrent neural network: 43 chunks
  Transformer (deep learning architecture): 53 chunks
  BERT (language model): 20 chunks
  GPT-4: 15 chunks
  Retrieval-augmented generation: 14 chunks

Indexing: Vector Store + BM25¶

Now we set up both retrieval backends — dense (ChromaDB) and sparse (BM25) — so we can run hybrid search.

# Dense index (ChromaDB)
client = chromadb.Client()

# Delete collection if it exists from a previous run
try:
    client.delete_collection("wiki_rag")
except Exception:
    pass

collection = client.create_collection("wiki_rag", metadata={"hnsw:space": "cosine"})

# Embed and store all chunks
print("Embedding chunks... ", end="")
chunk_embeddings = embed_model.encode(all_chunks, show_progress_bar=True).tolist()

collection.add(
    ids=[f"chunk_{i}" for i in range(len(all_chunks))],
    documents=all_chunks,
    embeddings=chunk_embeddings,
    metadatas=[{"source": src} for src in chunk_sources],
)
print(f"Indexed {collection.count()} chunks in ChromaDB")

# Sparse index (BM25)
tokenized_chunks = [chunk.lower().split() for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_chunks)
print(f"Built BM25 index over {len(tokenized_chunks)} chunks")

Embedding chunks...

Indexed 246 chunks in ChromaDB
Built BM25 index over 246 chunks

Part B: The Full Advanced RAG Pipeline¶

Let’s wire up the complete pipeline from Part 02: hybrid search with RRF fusion, cross-encoder reranking, and LLM generation with citations.

Pipeline Functions¶

def reciprocal_rank_fusion(ranked_lists, k=60):
    """Merge multiple ranked lists using RRF."""
    rrf_scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = 0
            rrf_scores[doc_id] += 1 / (k + rank + 1)
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)


def hybrid_search(query, n=20):
    """Run BM25 + dense search and fuse with RRF."""
    # BM25
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_ranked = bm25_scores.argsort()[::-1][:n].tolist()

    # Dense
    dense_results = collection.query(
        query_embeddings=embed_model.encode([query]).tolist(),
        n_results=n,
    )
    dense_ranked = [int(id.split("_")[1]) for id in dense_results["ids"][0]]

    # Fuse
    return reciprocal_rank_fusion([bm25_ranked, dense_ranked])


def rerank(query, candidate_indices, top_k=5):
    """Rerank candidates with cross-encoder."""
    pairs = [[query, all_chunks[idx]] for idx in candidate_indices]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(candidate_indices, scores), key=lambda x: x[1], reverse=True)
    return reranked[:top_k]


def rag_pipeline(query, top_k=5):
    """Full pipeline: hybrid search → rerank → format context."""
    # Retrieve wide
    hybrid_results = hybrid_search(query, n=30)
    candidate_indices = [idx for idx, _ in hybrid_results]

    # Rerank narrow
    top_chunks = rerank(query, candidate_indices, top_k=top_k)

    # Format context with source attribution
    context_parts = []
    sources = []
    for i, (idx, score) in enumerate(top_chunks):
        source = chunk_sources[idx]
        context_parts.append(f"[{i+1}] (Source: {source})\n{all_chunks[idx]}")
        sources.append(source)

    context = "\n\n".join(context_parts)
    return context, top_chunks, sources

Test the Pipeline¶

query = "How do transformer models handle long-range dependencies in text?"
context, top_chunks, sources = rag_pipeline(query)

print(f"Query: '{query}'\n")
print(f"Retrieved {len(top_chunks)} chunks from: {', '.join(dict.fromkeys(sources))}\n")
for i, (idx, score) in enumerate(top_chunks):
    print(f"[{i+1}] (rerank: {score:.2f}, source: {chunk_sources[idx]})")
    print(f"    {all_chunks[idx][:100]}...")
    print()

Query: 'How do transformer models handle long-range dependencies in text?'

Retrieved 5 chunks from: Transformer (deep learning architecture)

[1] (rerank: 0.71, source: Transformer (deep learning architecture))
    One set of 
  
    
      
        
          (
          
            
              W
            ...

[2] (rerank: 0.37, source: Transformer (deep learning architecture))
    In deep learning, the transformer is an artificial neural network architecture based on the multi-he...

[3] (rerank: -0.72, source: Transformer (deep learning architecture))
    === Terminology ===
The transformer architecture, being modular, allows variations. Several common v...

[4] (rerank: -0.99, source: Transformer (deep learning architecture))
    === Sub-quadratic transformers ===
Training transformer-based architectures can be expensive, especi...

[5] (rerank: -1.11, source: Transformer (deep learning architecture))
    The attention mechanism used in the transformer architecture are scaled dot-product attention units....

# Generate a grounded answer
rag_agent = Agent(
    get_model("claude-sonnet-4-6"),
    instructions=(
        "Answer the user's question based ONLY on the provided context. "
        "Cite your sources using [1], [2], etc. "
        "If the context doesn't contain enough information, say so."
    ),
)

result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
print(result.output)

Based on the provided context, transformer models handle long-range dependencies in text through several mechanisms:

1. **Multi-head attention across layers**: The scope of attention can expand as tokens pass through successive layers, allowing the model to "capture more complex and long-range dependencies in deeper layers" [1].

2. **Contextualization within the context window**: At each layer, each token is "contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished" [2].

3. **Multiple attention heads**: Each layer contains multiple attention heads, allowing the model to capture different definitions of "relevance" simultaneously. This means the model can track various types of relationships between tokens across the text at the same time [1].

4. **Scaled dot-product attention**: The attention mechanism computes relationships between any two tokens directly through query-key dot products, regardless of their distance in the sequence. The attention weight between token *i* and token *j* is computed as the dot product between their respective query and key vectors [5].

It is worth noting that for very long inputs, standard transformers can be computationally expensive, which has motivated the development of more efficient variants like the Swin transformer for images and SepTr for audio [4].

Let’s try a few more queries to see the pipeline in action across different topics:

test_queries = [
    "What is the difference between BPE and WordPiece tokenization?",
    "How does BERT differ from GPT in architecture?",
    "What are the limitations of TF-IDF for text retrieval?",
]

for q in test_queries:
    ctx, chunks, srcs = rag_pipeline(q, top_k=3)
    print(f"Q: {q}")
    print(f"   Sources: {', '.join(dict.fromkeys(srcs))}")
    print(f"   Top chunk: {all_chunks[chunks[0][0]][:80]}...")
    print()

Q: What is the difference between BPE and WordPiece tokenization?
   Sources: BERT (language model), Tokenization (lexical analysis)
   Top chunk: === Embedding ===
This section describes the embedding used by BERTBASE. The oth...

Q: How does BERT differ from GPT in architecture?
   Sources: BERT (language model)
   Top chunk: == Interpretation ==
Language models like ELMo, GPT-2, and BERT, spawned the stu...

Q: What are the limitations of TF-IDF for text retrieval?
   Sources: Tf–idf
   Top chunk: In information retrieval, tf–idf (term frequency–inverse document frequency, TF*...

Part C: Query Transformation with HyDE¶

So far, our queries have been well-formed questions that match the document content reasonably well. But what about vague or short queries? Consider: “attention” — is the user asking about the attention mechanism in transformers? Attention in cognitive science? The word’s dictionary definition?

Short queries produce poor embeddings because there’s not enough context to capture the user’s intent. HyDE (Hypothetical Document Embeddings) is a clever solution: instead of embedding the raw query, we first ask the LLM to generate a hypothetical answer, then embed that and use it for retrieval.

The intuition: a hypothetical answer looks much more like a real document passage than a short query does, so it will land closer to the right chunks in embedding space.

HyDE in Action¶

async def hyde_search(query, n=20):
    """Generate a hypothetical answer, embed it, and search."""
    # Step 1: Generate hypothetical answer
    hyde_agent = Agent(
        get_model("claude-sonnet-4-6"),
        instructions=(
            "Write a short, factual paragraph (3-4 sentences) that would answer "
            "the following question. Write as if you are a textbook. "
            "Do not say 'I don't know' — just write your best answer."
        ),
    )
    hypothetical = await hyde_agent.run(query)
    hypo_text = hypothetical.output

    # Step 2: Embed the hypothetical answer (not the original query)
    hypo_embedding = embed_model.encode([hypo_text]).tolist()

    # Step 3: Search with the hypothetical embedding
    results = collection.query(
        query_embeddings=hypo_embedding,
        n_results=n,
    )

    return results, hypo_text

# Compare standard vs. HyDE retrieval on a short query
short_query = "attention"

# Standard dense search
standard_results = collection.query(
    query_embeddings=embed_model.encode([short_query]).tolist(),
    n_results=3,
)

print(f"Query: '{short_query}'\n")
print("--- Standard Dense Search ---")
for i, (doc, dist) in enumerate(zip(standard_results["documents"][0], standard_results["distances"][0])):
    print(f"  [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")

# HyDE search
hyde_results, hypo_text = await hyde_search(short_query)

print(f"\n--- HyDE Search ---")
print(f"Hypothetical answer: {hypo_text[:120]}...\n")
for i, (doc, dist) in enumerate(zip(hyde_results["documents"][0][:3], hyde_results["distances"][0][:3])):
    print(f"  [{i+1}] (sim: {1-dist:.3f}) {doc[:80]}...")

Query: 'attention'

--- Standard Dense Search ---
  [1] (sim: 0.458) One set of 
  
    
      
        
          (
          
            
        ...
  [2] (sim: 0.453) Seq2seq models with attention (including self-attention) still suffered from the...
  [3] (sim: 0.440) Multihead Latent Attention (MLA) is a low-rank approximation to standard MHA. Sp...


--- HyDE Search ---
Hypothetical answer: **Attention** is a cognitive process that allows individuals to selectively focus on specific stimuli or information in ...

  [1] (sim: 0.462) One set of 
  
    
      
        
          (
          
            
        ...
  [2] (sim: 0.413) Each encoder layer contains 2 sublayers: the self-attention and the feedforward ...
  [3] (sim: 0.412) === Multiple timescales model ===
A multiple timescales recurrent neural network...

HyDE should retrieve more relevant results for the ambiguous query “attention” — the hypothetical answer provides the context that “attention” refers to the transformer attention mechanism, producing an embedding that lands closer to the right passages.

Exercise 10.6: HyDE Experimentation

Try HyDE on the following queries and compare retrieval quality with standard dense search. For each, note whether HyDE improved, worsened, or made no difference to the top-3 results.

"embeddings" (short, ambiguous)
"How does named entity recognition work?" (well-formed question)
"GPT problems" (short, colloquial)

queries_to_test = [
    "embeddings",
    "How does named entity recognition work?",
    "GPT problems",
]

for q in queries_to_test:
    print(f"\nQuery: '{q}'")
    print("=" * 60)

    # Standard
    std = collection.query(
        query_embeddings=embed_model.encode([q]).tolist(),
        n_results=3,
    )
    print("Standard top-3:")
    for doc, dist in zip(std["documents"][0], std["distances"][0]):
        print(f"  (sim: {1-dist:.3f}) {doc[:60]}...")

    # HyDE
    hyde_res, hypo = await hyde_search(q, n=3)
    print(f"\nHyDE hypothetical: {hypo[:80]}...")
    print("HyDE top-3:")
    for doc, dist in zip(hyde_res["documents"][0], hyde_res["distances"][0]):
        print(f"  (sim: {1-dist:.3f}) {doc[:60]}...")

    # Your analysis:
    # Did HyDE help? Why or why not?
    print("\n  Analysis: ???")

Key question: For which types of queries does HyDE help most? For which does it make little difference? Why?

Part D: Evaluate Your Pipeline¶

Before Week 11’s deep dive into evaluation frameworks, let’s build intuition for what “good RAG” looks like with a simple manual evaluation.

Create a Test Suite¶

The idea: write queries where you know what a good answer should contain, then score your pipeline’s actual answers.

# Test suite: queries with expected content
test_suite = [
    {
        "query": "What is tokenization and why is it important for NLP?",
        "expected_keywords": ["token", "subword", "BPE"],
    },
    {
        "query": "How do word embeddings capture semantic meaning?",
        "expected_keywords": ["vector", "Word2Vec", "semantic"],
    },
    {
        "query": "What problem does the attention mechanism solve?",
        "expected_keywords": ["attention", "long-range", "parallel"],
    },
    {
        "query": "How does RAG reduce hallucination in language models?",
        "expected_keywords": ["retriev", "hallucin", "knowledge"],
    },
    {
        "query": "What are the main differences between BERT and GPT?",
        "expected_keywords": ["BERT", "GPT", "encoder"],
    },
]

print(f"Test suite: {len(test_suite)} queries")
for i, test in enumerate(test_suite):
    print(f"  {i+1}. {test['query']}")

Test suite: 5 queries
  1. What is tokenization and why is it important for NLP?
  2. How do word embeddings capture semantic meaning?
  3. What problem does the attention mechanism solve?
  4. How does RAG reduce hallucination in language models?
  5. What are the main differences between BERT and GPT?

Run and Score¶

We’ll score each answer on two dimensions:

Relevance (1–3): Does the retrieved context contain information needed to answer the question?
Faithfulness (1–3): Is the generated answer supported by the retrieved context (not hallucinated)?

print("Running pipeline on test suite...\n")

results = []
for test in test_suite:
    query = test["query"]
    context, top_chunks, sources = rag_pipeline(query, top_k=3)

    result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {query}")
    answer = result.output

    # Check how many expected keywords appear in the context
    context_lower = context.lower()
    keywords_found = sum(
        1 for kw in test["expected_keywords"]
        if kw.lower() in context_lower
    )

    results.append({
        "query": query,
        "answer": answer,
        "sources": sources,
        "keywords_found": keywords_found,
        "keywords_total": len(test["expected_keywords"]),
    })

    print(f"Q: {query}")
    print(f"   Sources: {', '.join(dict.fromkeys(sources))}")
    print(f"   Keywords in context: {keywords_found}/{len(test['expected_keywords'])}")
    print(f"   Answer: {answer[:120]}...")
    print()

Running pipeline on test suite...

Q: What is tokenization and why is it important for NLP?
   Sources: Natural language processing
   Keywords in context: 1/3
   Answer: ## Tokenization in NLP

Based on the provided context, **tokenization** is mentioned as a preprocessing step in NLP pipe...

Q: How do word embeddings capture semantic meaning?
   Sources: Word embedding, Natural language processing
   Keywords in context: 3/3
   Answer: Based on the provided context, word embeddings capture semantic meaning in the following ways:

**Core Representation**
...

Q: What problem does the attention mechanism solve?
   Sources: Transformer (deep learning architecture)
   Keywords in context: 2/3
   Answer: Based on the provided context, the attention mechanism primarily helps solve the **parallelization problem** that plague...

Q: How does RAG reduce hallucination in language models?
   Sources: Retrieval-augmented generation
   Keywords in context: 3/3
   Answer: Based on the provided context, RAG helps **reduce** hallucinations, though it does not eliminate them entirely.

RAG red...

Q: What are the main differences between BERT and GPT?
   Sources: Transformer (deep learning architecture), BERT (language model)
   Keywords in context: 3/3
   Answer: Based on the provided context, I can identify some differences between BERT and GPT, though the information is limited:
...

# Summary
total_found = sum(r["keywords_found"] for r in results)
total_expected = sum(r["keywords_total"] for r in results)
print(f"Context coverage: {total_found}/{total_expected} expected keywords found ({total_found/total_expected*100:.0f}%)")
print()
print("Per-query breakdown:")
for r in results:
    score = r["keywords_found"] / r["keywords_total"]
    status = "GOOD" if score >= 0.67 else "WEAK" if score >= 0.33 else "POOR"
    print(f"  [{status}] {r['query'][:50]}... ({r['keywords_found']}/{r['keywords_total']})")

Context coverage: 12/15 expected keywords found (80%)

Per-query breakdown:
  [WEAK] What is tokenization and why is it important for N... (1/3)
  [GOOD] How do word embeddings capture semantic meaning?... (3/3)
  [WEAK] What problem does the attention mechanism solve?... (2/3)
  [GOOD] How does RAG reduce hallucination in language mode... (3/3)
  [GOOD] What are the main differences between BERT and GPT... (3/3)

Exercise 10.7: Build Your Own Test Suite

Create your own test suite of 5 queries and evaluate the pipeline. Include:

At least one factoid query with a specific expected answer (e.g., “Who introduced the transformer architecture?”)
At least one comparison query that requires synthesizing information from multiple articles
At least one short/ambiguous query that might challenge retrieval

For each query, define expected_topics and run the pipeline. Then manually read the generated answers and assign:

Relevance (1 = irrelevant context, 2 = partially relevant, 3 = highly relevant)
Faithfulness (1 = hallucinated, 2 = partially grounded, 3 = fully grounded in context)

my_test_suite = [
    {
        "query": "YOUR QUERY HERE",
        "expected_keywords": ["keyword1", "keyword2"],
    },
    # Add 4 more...
]

# Run your test suite
for test in my_test_suite:
    context, top_chunks, sources = rag_pipeline(test["query"], top_k=3)
    result = await rag_agent.run(f"Context:\n{context}\n\nQuestion: {test['query']}")

    print(f"Q: {test['query']}")
    print(f"A: {result.output[:150]}...")
    print(f"Sources: {', '.join(dict.fromkeys(sources))}")
    # Score manually:
    # Relevance (1-3): ???
    # Faithfulness (1-3): ???
    print()

Part E: Build Your Own RAG System¶

Exercise 10.8: RAG on Your Own Corpus

Build a complete RAG system on a corpus of your choice. You can use Wikipedia articles on different topics, or bring your own text (e.g., documentation, research papers, course notes).

Requirements:

Corpus: At least 5 documents, chunked with paragraph-based or sentence-aware chunking
Retrieval: Implement at least two of: dense search, hybrid search, hybrid + reranking
Enhancement: Try at least one of: HyDE query transformation, different chunking strategy, different chunk sizes
Evaluation: Create a 5-query test suite and report relevance + faithfulness scores

Starter code:

# 1. Choose your topics
my_topics = [
    "YOUR TOPIC 1",
    "YOUR TOPIC 2",
    # ... at least 5 topics
]

# 2. Fetch and chunk
my_articles = {}
for topic in my_topics:
    page = wikipedia.page(topic, auto_suggest=False)
    my_articles[topic] = page.content

# 3. Chunk, index, build pipeline (reuse functions from above)
# ...

# 4. Test and evaluate
# ...

Deliverable: A notebook that includes your corpus, pipeline implementation, test suite results, and a brief paragraph summarizing what you learned about how different retrieval strategies affected answer quality.

Wrap-Up¶

Key Takeaways¶

Key Takeaways

Real corpora behave differently from toy examples — with 10 Wikipedia articles and hundreds of chunks, retrieval precision matters much more than in a 6-document demo
The full Advanced RAG pipeline (paragraph chunking → hybrid search → cross-encoder reranking → LLM generation) provides a solid foundation for most knowledge-intensive applications
HyDE (Hypothetical Document Embeddings) improves retrieval for short or ambiguous queries by generating a hypothetical answer first and searching with its embedding — but adds latency and an LLM call per query
Manual evaluation with a test suite builds intuition for what makes a RAG system succeed or fail — even before formal metrics, checking whether expected topics appear in retrieved context is revealing
Source tracking and citation are essential — knowing which article each chunk came from enables both user trust and debugging when answers go wrong
No single technique wins everywhere — hybrid search helps keyword queries, HyDE helps ambiguous queries, and reranking helps when many chunks are partially relevant. The best systems combine multiple strategies

What’s Next¶

In Week 11, we’ll formalize everything we did by hand today. You’ll learn the RAGAS framework for automated RAG evaluation — metrics like faithfulness, answer relevance, context precision, and context recall that let you score your pipeline without manually reading every answer. We’ll also explore LLM-as-judge techniques and build evaluation into your development workflow.