Sparse Representations: From Words to Vectors

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

Week 2: Tokenization, text normalization, SpaCy pipelines
Basic Python: lists, dictionaries, for loops

Outcomes

Construct bag of words and term-frequency representations from tokenized text
Derive and compute TF-IDF scores, explaining the role of each component
Use Scikit-learn’s CountVectorizer and TfidfVectorizer for feature extraction
Identify the limitations of sparse representations and when they fall short

References

J&M Chapter 5: Embeddings — Sections 5.1–5.4 (download)
J&M Chapter 11: Information Retrieval — TF-IDF and BM25
Scikit-learn: Text Feature Extraction

Which Numbers?¶

We already know that NLP models need numeric inputs — we’ve seen how tokenization converts raw text into discrete units. But there’s a big gap between a list of tokens and something we can feed into a machine learning model. The question isn’t whether to convert text to numbers, but how.

Suppose we have these four short documents:

“The cat sat on the mat”
“The dog sat on the log”
“The cat chased the dog”
“A bird sang on the branch”

How should we represent each document as a vector of numbers? What should the dimensions mean? How many dimensions do we need? These are the questions we’ll answer in this lecture.

We’ll start with the simplest possible approach and progressively refine it, building toward the TF-IDF representation that powered search engines and text classifiers for decades — and still serves as a strong baseline today.

Binary Presence: The Simplest Vector¶

The most straightforward idea: build a vocabulary of all unique words across our documents, then represent each document as a vector where each position corresponds to a word. If the word appears in the document, we put a 1. If not, a 0.

Let’s build this by hand. First, we need our corpus:

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog",
    "a bird sang on the branch",
]

Now let’s construct a vocabulary — the set of all unique words:

# Tokenize by splitting on whitespace (simple but effective for clean text)
tokenized = [doc.split() for doc in corpus]

# Build vocabulary: sorted list of unique words
vocab = sorted(set(word for doc in tokenized for word in doc))
print(f"Vocabulary ({len(vocab)} words): {vocab}")

Vocabulary (12 words): ['a', 'bird', 'branch', 'cat', 'chased', 'dog', 'log', 'mat', 'on', 'sang', 'sat', 'the']

With our vocabulary in hand, we can build a binary presence vector for each document:

# Build binary vectors
for i, doc in enumerate(tokenized):
    word_set = set(doc)
    binary_vec = [1 if word in word_set else 0 for word in vocab]
    print(f"Doc {i+1}: {binary_vec}")

Doc 1: [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1]
Doc 2: [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1]
Doc 3: [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]
Doc 4: [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1]

# Let's make this easier to read with a DataFrame
import pandas as pd

binary_matrix = []
for doc in tokenized:
    word_set = set(doc)
    binary_matrix.append([1 if w in word_set else 0 for w in vocab])

df_binary = pd.DataFrame(binary_matrix, columns=vocab, index=[f"Doc {i+1}" for i in range(len(corpus))])
df_binary

This is called a term-document matrix — each row is a document, each column is a term from our vocabulary.

Notice anything? Documents 1 and 2 share several words (“the”, “sat”, “on”), while Document 4 is quite different from the others. We’ve turned text into geometry, and that’s powerful. But we’ve also lost something: the word “the” appears twice in Document 1, and our binary representation treats that the same as appearing once.

Term Frequency: Counting Occurrences¶

A natural improvement: instead of just recording whether a word appears, count how many times it appears. This gives us a term frequency (TF) representation.

from collections import Counter

# Build term frequency vectors
tf_matrix = []
for doc in tokenized:
    counts = Counter(doc)
    tf_vec = [counts.get(w, 0) for w in vocab]
    tf_matrix.append(tf_vec)

df_tf = pd.DataFrame(tf_matrix, columns=vocab, index=[f"Doc {i+1}" for i in range(len(corpus))])
df_tf

Now we can see that “the” appears twice in Documents 1, 2, and 3, while “a” only appears once in Document 4. We’re capturing more information than the binary representation.

But look at “the” — it appears in almost every document and has high counts. Is “the” really the most important word for understanding what a document is about? Probably not. We need a way to account for the fact that some words are common across all documents and thus not very informative.

The Bag of Words Model¶

What we’ve just built has a formal name: the bag of words (BoW) model. The name captures a key property — we treat each document as an unordered collection (a “bag”) of words. Word order is completely discarded.

Think about it: the sentences “the cat chased the dog” and “the dog chased the cat” would have identical BoW representations, even though they describe quite different events. That’s a significant limitation, and it’s worth keeping in mind as we work with this representation.

Despite this limitation, BoW is surprisingly effective for many tasks. For document classification, for example, the presence and frequency of certain words often matters more than their exact arrangement. A movie review containing “brilliant”, “masterpiece”, and “stunning” is probably positive regardless of word order.

TF-IDF: Making Frequencies Smarter¶

The BoW model counts words, but it treats all words equally. The word “the” gets a high count in nearly every English document, but it tells us nothing about what the document is about. What we want is a way to highlight words that are distinctive to a particular document.

This is exactly the intuition behind TF-IDF — Term Frequency–Inverse Document Frequency. The idea is elegant: weight each word by how frequent it is in this document (that’s good) but penalize it for being frequent across all documents (that’s uninformative).

Term Frequency (TF)¶

We already have this — it’s just the count of how many times term $t$ appears in document $d$ :

\text{tf}(t, d) = \text{count of } t \text{ in } d

(1)

Some implementations normalize by the total number of terms in the document, but the raw count is the simplest form.

Inverse Document Frequency (IDF)¶

This is the key ingredient. IDF measures how rare a word is across the entire corpus. The formula is:

\text{idf}(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}

(2)

where $N$ is the total number of documents and the denominator counts how many documents contain term $t$ .

Let’s unpack this:

If a term appears in every document, IDF = $\log(N/N) = \log(1) = 0$ . The term is completely uninformative.
If a term appears in only one document, IDF = $\log(N/1) = \log(N)$ . It’s highly distinctive.
The logarithm prevents rare words from dominating too aggressively.

Putting It Together¶

The TF-IDF score for term $t$ in document $d$ is simply the product:

\text{tf-idf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D)

(3)

Words that appear frequently in a document and rarely in the corpus get high scores. Words that appear everywhere get scores near zero.

Worked Example¶

Let’s compute TF-IDF for our corpus by hand, then verify with code. We’ll focus on a few interesting words.

import math

N = len(corpus)  # 4 documents

# Compute document frequency (df) for each word
doc_freq = {}
for w in vocab:
    doc_freq[w] = sum(1 for doc in tokenized if w in doc)

print("Document frequencies:")
for w in vocab:
    print(f"  '{w}': appears in {doc_freq[w]} of {N} documents")

Document frequencies:
  'a': appears in 1 of 4 documents
  'bird': appears in 1 of 4 documents
  'branch': appears in 1 of 4 documents
  'cat': appears in 2 of 4 documents
  'chased': appears in 1 of 4 documents
  'dog': appears in 2 of 4 documents
  'log': appears in 1 of 4 documents
  'mat': appears in 1 of 4 documents
  'on': appears in 3 of 4 documents
  'sang': appears in 1 of 4 documents
  'sat': appears in 2 of 4 documents
  'the': appears in 4 of 4 documents

# Compute IDF for each word
idf = {}
for w in vocab:
    idf[w] = math.log(N / doc_freq[w])

print("\nIDF scores:")
for w in sorted(idf, key=idf.get, reverse=True):
    print(f"  '{w}': {idf[w]:.3f}")


IDF scores:
  'a': 1.386
  'bird': 1.386
  'branch': 1.386
  'chased': 1.386
  'log': 1.386
  'mat': 1.386
  'sang': 1.386
  'cat': 0.693
  'dog': 0.693
  'sat': 0.693
  'on': 0.288
  'the': 0.000

Notice how “the” and “on” get low IDF scores (they appear in many documents), while words like “bird”, “branch”, “sang” get high IDF scores (they appear in only one document).

Now let’s compute the full TF-IDF matrix:

# Compute TF-IDF
tfidf_matrix = []
for doc in tokenized:
    counts = Counter(doc)
    tfidf_vec = [counts.get(w, 0) * idf[w] for w in vocab]
    tfidf_matrix.append(tfidf_vec)

df_tfidf = pd.DataFrame(
    [[f"{v:.2f}" for v in row] for row in tfidf_matrix],
    columns=vocab,
    index=[f"Doc {i+1}" for i in range(len(corpus))],
)
df_tfidf

Look at what happened:

“the” gets zeroed out (or very low scores) — it appears in almost every document
Words like “cat”, “mat”, “bird” get meaningful scores that distinguish the documents
Document 4 stands out more clearly now, with high scores for its unique words

Feature Extraction with Scikit-learn¶

Building TF-IDF from scratch is instructive, but in practice we use battle-tested implementations. Scikit-learn provides two key tools: CountVectorizer for BoW and TfidfVectorizer for TF-IDF.

CountVectorizer: Industrial-Strength BoW¶

from sklearn.feature_extraction.text import CountVectorizer

# Fit a CountVectorizer on our corpus
count_vec = CountVectorizer()
bow_matrix = count_vec.fit_transform(corpus)

# Inspect the vocabulary
print("Vocabulary mapping:")
print(count_vec.vocabulary_)
print(f"\nMatrix shape: {bow_matrix.shape}")
print(f"Matrix type: {type(bow_matrix)}")

Vocabulary mapping:
{'the': 10, 'cat': 2, 'sat': 9, 'on': 7, 'mat': 6, 'dog': 4, 'log': 5, 'chased': 3, 'bird': 0, 'sang': 8, 'branch': 1}

Matrix shape: (4, 11)
Matrix type: <class 'scipy.sparse._csr.csr_matrix'>

# Convert to a dense DataFrame for inspection
feature_names = count_vec.get_feature_names_out()
df_bow = pd.DataFrame(bow_matrix.toarray(), columns=feature_names, index=[f"Doc {i+1}" for i in range(len(corpus))])
df_bow

Notice that CountVectorizer automatically tokenizes the text (splitting on whitespace and punctuation), lowercases everything, and builds the vocabulary. The result is a sparse matrix — more on that in a moment.

TfidfVectorizer: One-Stop TF-IDF¶

from sklearn.feature_extraction.text import TfidfVectorizer

# Fit a TfidfVectorizer on our corpus
tfidf_vec = TfidfVectorizer()
tfidf_matrix_sk = tfidf_vec.fit_transform(corpus)

df_tfidf_sk = pd.DataFrame(
    tfidf_matrix_sk.toarray().round(3),
    columns=tfidf_vec.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))],
)
df_tfidf_sk

Wait — “the” is the highest-scoring word in Documents 1–3? Our manual version gave it zero. That seems to defeat the whole purpose of TF-IDF. What happened?

Scikit-learn’s formula differs from the textbook version in a way that matters on tiny corpora like ours:

IDF has a +1 floor: sklearn computes $\text{idf} = \log\!\bigl(\frac{1 + N}{1 + \text{df}}\bigr) + 1$ . That trailing +1 means IDF never reaches zero. In our formula, “the” gets $\log(4/4) = 0$ . In sklearn’s, it gets $\log(5/5) + 1 = 1.0$ .
High TF wins: “the” appears twice in Documents 1–3, so its raw score is $2 \times 1.0 = 2.0$ — higher than “mat” ( $1 \times 1.916 = 1.916$ ) even though “mat” has a much higher IDF. On our 4-document corpus, the frequency advantage of “the” overwhelms the IDF penalty.
L2 normalization: sklearn then normalizes each vector to unit length, which is why values fall in the 0–1 range — but this doesn’t change the ranking within a document.

This is an artifact of our tiny corpus. On a real corpus with thousands of documents, “the” would still get the minimum IDF, but content words would appear often enough that they’d dominate. The +1 floor is a practical choice: it prevents terms from being completely zeroed out, and L2 normalization ensures longer documents don’t dominate shorter ones in downstream ML tasks.

The takeaway: our manual implementation captures the core intuition of TF-IDF more clearly — common words get zeroed out, rare words get boosted. Sklearn’s version applies the same principle with practical adjustments that work better at scale but can look surprising on toy examples. In practice, you’ll typically combine TfidfVectorizer with stop word removal or max_df filtering to handle words like “the” — which we’ll see in a moment.

Useful Parameters¶

Both vectorizers accept parameters that control how text is processed:

# max_features: limit vocabulary size to most frequent terms
vec_small = CountVectorizer(max_features=5)
vec_small.fit(corpus)
print(f"Top 5 features: {vec_small.get_feature_names_out()}")

# min_df / max_df: filter by document frequency
vec_filtered = CountVectorizer(min_df=2)  # word must appear in >= 2 docs
vec_filtered.fit(corpus)
print(f"Features appearing in 2+ docs: {vec_filtered.get_feature_names_out()}")

# ngram_range: capture word pairs (bigrams) too
vec_bigram = CountVectorizer(ngram_range=(1, 2))
vec_bigram.fit(corpus)
print(f"\nWith bigrams ({len(vec_bigram.get_feature_names_out())} features):")
print(vec_bigram.get_feature_names_out())

Top 5 features: ['cat' 'dog' 'on' 'sat' 'the']
Features appearing in 2+ docs: ['cat' 'dog' 'on' 'sat' 'the']

With bigrams (24 features):
['bird' 'bird sang' 'branch' 'cat' 'cat chased' 'cat sat' 'chased'
 'chased the' 'dog' 'dog sat' 'log' 'mat' 'on' 'on the' 'sang' 'sang on'
 'sat' 'sat on' 'the' 'the branch' 'the cat' 'the dog' 'the log' 'the mat']

The ngram_range parameter is particularly interesting — by including bigrams (pairs of consecutive words), we start to capture some word order information that the basic BoW model throws away. The cost is a much larger vocabulary.

Why Sparse Matrices?¶

You may have noticed that the vectorizers return scipy.sparse matrices rather than regular NumPy arrays. Let’s see why:

# Imagine a real corpus with a large vocabulary
print(f"Our small corpus: {bow_matrix.shape[0]} docs × {bow_matrix.shape[1]} features")
print(f"Non-zero entries: {bow_matrix.nnz} out of {bow_matrix.shape[0] * bow_matrix.shape[1]} total")
print(f"Sparsity: {1 - bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.1%}")

Our small corpus: 4 docs × 11 features
Non-zero entries: 19 out of 44 total
Sparsity: 56.8%

Even with our tiny corpus, most entries are zero. In real-world applications with thousands of documents and tens of thousands of vocabulary words, the matrix would be over 99% zeros. Storing all those zeros in a dense array wastes memory. Sparse matrices store only the non-zero values, making them much more efficient.

Limitations of Sparse Representations¶

We’ve built up from binary presence to term frequency to TF-IDF — each step was an improvement. But sparse representations have fundamental limitations that no amount of clever weighting can fix.

High Dimensionality¶

The vocabulary size is the number of dimensions. A typical English corpus might have 50,000–100,000 unique words. That means each document is a point in a 50,000-dimensional space where almost all coordinates are zero. This is both wasteful and problematic for many ML algorithms.

No Semantic Similarity¶

In a sparse representation, every word gets its own dimension. The words “happy” and “joyful” are completely orthogonal — their vectors share no dimensions at all, so their similarity is zero:

from sklearn.metrics.pairwise import cosine_similarity

# Two semantically similar sentences
similar_docs = [
    "I am happy and joyful today",
    "I am glad and cheerful today",
]

vec = CountVectorizer()
vecs = vec.fit_transform(similar_docs)

sim = cosine_similarity(vecs[0], vecs[1])[0][0]
print(f"Cosine similarity: {sim:.3f}")
print(f"Vocabulary: {vec.get_feature_names_out()}")
print(f"Vectors:\n{vecs.toarray()}")

Cosine similarity: 0.600
Vocabulary: ['am' 'and' 'cheerful' 'glad' 'happy' 'joyful' 'today']
Vectors:
[[1 1 0 0 1 1 1]
 [1 1 1 1 0 0 1]]

These two sentences mean nearly the same thing, but their BoW vectors only overlap on common words like “I”, “am”, “and”, “today”. The synonyms (“happy”/“glad”, “joyful”/“cheerful”) contribute zero similarity.

No Word Order¶

As we saw earlier, “the cat chased the dog” and “the dog chased the cat” are identical under BoW. For many tasks this is acceptable, but for tasks requiring understanding of who did what to whom, it’s a serious gap.

The Path Forward¶

These limitations point directly to what we’ll explore in the next lecture: dense representations. Instead of one dimension per word, what if we could represent words (and documents) in a compact space where semantically similar items are close together? That’s the promise of word embeddings — and it’s where we’re headed next.

Exercise 3.3: Sparse Similarity

Using our original four-document corpus:

Compute the cosine similarity between all pairs of documents using their TF-IDF vectors from Scikit-learn. Which pair is most similar? Does this match your intuition?
Now consider: “the feline rested on the rug.” This is semantically similar to Document 1 (“the cat sat on the mat”), but uses different words. What would its cosine similarity be with Document 1? Why is this a problem?

# Starter code for Exercise 3
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)

# TODO: compute pairwise cosine similarity
# HINT: cosine_similarity(tfidf_matrix) gives you the full similarity matrix

# TODO: add "the feline rested on the rug" and check its similarity to Doc 1

Wrap-Up¶

Key Takeaways¶

What’s Next¶

In the next lecture, we’ll move from sparse representations to dense representations. We’ll explore word embeddings — Word2Vec, GloVe, and subword methods like BPE — where words live in a compact space and “happy” is actually close to “joyful”.