Text Normalization: Stemming, Lemmatization, and Regex
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
Week 1: What is NLP (tokens, SpaCy basics)
Week 2 Part 1: Tokenization
Basic Python string manipulation
Outcomes
Understand why text normalization reduces vocabulary and improves model generalization
Compare stemming vs. lemmatization and know when to use each
Apply NLTK stemmers and SpaCy’s lemmatizer
Write basic regular expressions for text cleaning
Handle common normalization tasks: case folding, punctuation, stop word filtering
References
J&M Chapter 2: Words and Tokens (download)
Why Normalize Text?¶
In the last lecture, we learned that tokenization breaks text into pieces. But here’s a problem: the same concept appears in many surface forms.
# All of these express similar ideas about the verb "run"
variations = ["run", "runs", "running", "ran", "runner"]
print(f"5 tokens, but really 1-2 concepts: {variations}")5 tokens, but really 1-2 concepts: ['run', 'runs', 'running', 'ran', 'runner']
If our model treats each form as completely unrelated, it must learn everything separately. Normalization solves this by reducing words to a standard form, shrinking our vocabulary while preserving meaning.
We’ll cover three main techniques:
Case folding — the simplest normalization
Stemming — fast, crude suffix removal
Lemmatization — accurate, linguistic reduction
Plus regular expressions as a tool for pattern-based cleaning.
Case Folding and Basic Cleanup¶
The simplest normalization: convert everything to lowercase.
text = "The Quick Brown Fox Jumps Over THE LAZY DOG"
print(text.lower())the quick brown fox jumps over the lazy dog
This collapses “The”, “THE”, and “the” into one token. But be careful—sometimes case matters:
“US” (country) vs. “us” (pronoun)
“Apple” (company) vs. “apple” (fruit)
Named entities often use capitalization
For most tasks, lowercasing helps. For NER, you might keep it.
import re
def basic_normalize(text):
"""Simple normalization: lowercase and collapse whitespace."""
text = text.lower()
text = re.sub(r'\s+', ' ', text) # Collapse multiple spaces
return text.strip()
messy = " The Quick\n\nBrown Fox "
print(f"Before: '{messy}'")
print(f"After: '{basic_normalize(messy)}'")Before: ' The Quick
Brown Fox '
After: 'the quick brown fox'
Stemming: The Fast and Crude Approach¶
Stemming removes suffixes using heuristic rules to find a word’s “stem.” It’s fast but imprecise.
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
words = ["running", "runs", "ran", "studies", "studying", "happier", "happiness"]
print(f"{'Word':<12} {'Porter':<12} {'Lancaster':<12} {'Snowball':<12}")
print("-" * 48)
for word in words:
print(f"{word:<12} {porter.stem(word):<12} {lancaster.stem(word):<12} {snowball.stem(word):<12}")Word Porter Lancaster Snowball
------------------------------------------------
running run run run
runs run run run
ran ran ran ran
studies studi study studi
studying studi study studi
happier happier happy happier
happiness happi happy happi
Notice the problems:
“studies” → “studi” (not a real word!)
“ran” → “ran” (irregular verb not handled)
Lancaster is aggressive: “happiness” → “happy” but sometimes over-stems
When to Use Stemming¶
Stemming works well when:
Speed matters more than precision
Downstream task is robust to slight errors (search, basic classification)
You need language-agnostic processing (Snowball supports many languages)
# Stemming for search: "running shoes" matches "run shoe"
query = "running shoes"
document = "These shoes are great for runners who run daily"
query_stems = [porter.stem(w) for w in query.lower().split()]
doc_stems = [porter.stem(w) for w in document.lower().split()]
print(f"Query stems: {query_stems}")
print(f"Doc stems: {doc_stems}")
print(f"Overlap: {set(query_stems) & set(doc_stems)}")Query stems: ['run', 'shoe']
Doc stems: ['these', 'shoe', 'are', 'great', 'for', 'runner', 'who', 'run', 'daili']
Overlap: {'run', 'shoe'}
Lemmatization: The Linguistic Approach¶
Lemmatization uses vocabulary and morphological analysis to find a word’s dictionary form (lemma). It’s slower but more accurate.
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["running", "runs", "ran", "better", "studies", "mice", "are", "was"]
print(f"{'Word':<12} {'Lemma':<12} {'POS':<8}")
print("-" * 32)
for word in words:
doc = nlp(word)
token = doc[0]
print(f"{word:<12} {token.lemma_:<12} {token.pos_:<8}")Word Lemma POS
--------------------------------
running run VERB
runs run VERB
ran run VERB
better well ADV
studies study NOUN
mice mouse NOUN
are be AUX
was be AUX
Key differences from stemming:
“ran” → “run” (handles irregular verbs!)
“better” → “good” (handles irregular adjectives!)
“mice” → “mouse” (handles irregular plurals!)
“are”/“was” → “be” (maps conjugations to infinitive)
Lemmatization Needs Context¶
Lemmatization depends on part-of-speech. The word “meeting” could be a noun or a verb:
sentences = [
"I am meeting my friend.", # meeting = verb
"The meeting was boring.", # meeting = noun
]
for sent in sentences:
doc = nlp(sent)
for token in doc:
if token.text.lower() == "meeting":
print(f"'{sent}'")
print(f" 'meeting' -> lemma='{token.lemma_}', pos={token.pos_}")'I am meeting my friend.'
'meeting' -> lemma='meet', pos=VERB
'The meeting was boring.'
'meeting' -> lemma='meeting', pos=NOUN
SpaCy handles this automatically because it runs POS tagging before lemmatization.
Stemming vs. Lemmatization: Quick Comparison¶
test_words = ["running", "better", "studies", "wolves", "caring", "happily"]
print(f"{'Word':<12} {'Porter Stem':<14} {'SpaCy Lemma':<12}")
print("-" * 38)
for word in test_words:
stem = porter.stem(word)
doc = nlp(word)
lemma = doc[0].lemma_
print(f"{word:<12} {stem:<14} {lemma:<12}")Word Porter Stem SpaCy Lemma
--------------------------------------
running run run
better better well
studies studi study
wolves wolv wolf
caring care care
happily happili happily
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Speed | Fast | Slower |
| Output | May be non-words | Always valid words |
| Accuracy | Heuristic rules | Linguistic analysis |
| Context | Ignores context | Uses POS tags |
| Best for | Search, IR | Chatbots, analysis |
Stop Words: To Remove or Not?¶
Stop words are common words (the, is, at, which) that often carry little semantic meaning.
# SpaCy has a built-in stop word list
print(f"SpaCy has {len(nlp.Defaults.stop_words)} stop words")
print(f"Sample: {list(nlp.Defaults.stop_words)[:15]}")SpaCy has 326 stop words
Sample: ['the', 'various', 'side', 'any', 'toward', 'besides', 'whoever', 'seems', "'m", 'mine', 'ourselves', 'not', 'against', 'well', 'due']
# Filtering stop words
text = "The quick brown fox jumps over the lazy dog"
doc = nlp(text)
content_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(f"Original: {text}")
print(f"Content words: {content_words}")Original: The quick brown fox jumps over the lazy dog
Content words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
When Stop Words Matter¶
Don’t blindly remove stop words! They matter for:
Phrases: “to be or not to be” loses meaning without stop words
Negation: “not good” vs. “good”
Language models: Stop words are part of fluent text
# Negation example: stop word removal can flip meaning!
texts = ["This movie is not good", "This movie is good"]
for text in texts:
doc = nlp(text)
filtered = [t.text for t in doc if not t.is_stop]
print(f"Original: '{text}' -> Filtered: {filtered}")Original: 'This movie is not good' -> Filtered: ['movie', 'good']
Original: 'This movie is good' -> Filtered: ['movie', 'good']
Both sentences become ['movie', 'good']! For sentiment analysis, this is a disaster.
Regular Expressions for Text Cleaning¶
Regular expressions are patterns for matching text. They’re essential for cleaning messy data.
Core Patterns¶
import re
text = "Contact us at support@example.com or call 555-123-4567!"
# Character classes
print("Digits:", re.findall(r'\d+', text)) # \d = digit
print("Words:", re.findall(r'\w+', text)) # \w = word character
print("Non-word:", re.findall(r'\W+', text)) # \W = non-wordDigits: ['555', '123', '4567']
Words: ['Contact', 'us', 'at', 'support', 'example', 'com', 'or', 'call', '555', '123', '4567']
Non-word: [' ', ' ', ' ', '@', '.', ' ', ' ', ' ', '-', '-', '!']
Common NLP Cleaning Patterns¶
def clean_text(text):
"""Clean text using common regex patterns."""
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+\.\S+', '', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text)
return text.strip()
messy = "Check <b>this</b> out: https://example.com and email me@test.com please!"
print(f"Before: {messy}")
print(f"After: {clean_text(messy)}")Before: Check <b>this</b> out: https://example.com and email me@test.com please!
After: Check this out: and email please!
Quick Regex Reference¶
| Pattern | Matches | Example |
|---|---|---|
\d | Digit | \d+ matches “123” |
\w | Word char (letter, digit, _) | \w+ matches “hello_world” |
\s | Whitespace | \s+ matches spaces, tabs, newlines |
. | Any character | a.c matches “abc”, “a1c” |
* | 0 or more | ab* matches “a”, “ab”, “abbb” |
+ | 1 or more | ab+ matches “ab”, “abbb” (not “a”) |
? | 0 or 1 | colou?r matches “color”, “colour” |
[abc] | Character set | [aeiou] matches vowels |
^ | Start of string | ^Hello matches “Hello world” |
$ | End of string | world$ matches “Hello world” |
Putting It Together¶
Here’s a flexible normalization function combining our techniques:
import re
import spacy
nlp = spacy.load("en_core_web_sm")
def normalize_text(text, lowercase=True, remove_punct=True,
lemmatize=True, remove_stops=False):
"""Normalize text with configurable options."""
# Basic cleaning
text = re.sub(r'https?://\S+', '', text) # Remove URLs
text = re.sub(r'\s+', ' ', text) # Normalize whitespace
if lowercase:
text = text.lower()
# Process with SpaCy
doc = nlp(text)
tokens = []
for token in doc:
if remove_punct and token.is_punct:
continue
if remove_stops and token.is_stop:
continue
if lemmatize:
tokens.append(token.lemma_.lower() if lowercase else token.lemma_)
else:
tokens.append(token.text.lower() if lowercase else token.text)
return tokens
# Test it
text = "The researchers were studying NLP. They ran many experiments!"
print("Default:", normalize_text(text))
print("No lemma:", normalize_text(text, lemmatize=False))
print("No stops:", normalize_text(text, remove_stops=True))Default: ['the', 'researcher', 'be', 'study', 'nlp', 'they', 'run', 'many', 'experiment']
No lemma: ['the', 'researchers', 'were', 'studying', 'nlp', 'they', 'ran', 'many', 'experiments']
No stops: ['researcher', 'study', 'nlp', 'run', 'experiment']
When to Use What¶
| Technique | Use When | Avoid When |
|---|---|---|
| Lowercasing | Most tasks | NER, case-sensitive domains |
| Stemming | Search, IR, speed critical | Need valid words, accuracy critical |
| Lemmatization | Analysis, chatbots, need valid words | Speed critical, no SpaCy available |
| Stop word removal | Classification, clustering, topic modeling | Language modeling, negation matters |
| Regex cleaning | URLs, HTML, formatting noise | Complex linguistic patterns |
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In the next lecture, we’ll tackle real-world messiness: encoding issues, multilingual text, social media content, and building robust preprocessing pipelines that handle whatever text gets thrown at them.