Text Normalization: Stemming, Lemmatization, and Regex

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

Week 1: What is NLP (tokens, SpaCy basics)
Week 2 Part 1: Tokenization
Basic Python string manipulation

Outcomes

Understand why text normalization reduces vocabulary and improves model generalization
Compare stemming vs. lemmatization and know when to use each
Apply NLTK stemmers and SpaCy’s lemmatizer
Write basic regular expressions for text cleaning
Handle common normalization tasks: case folding, punctuation, stop word filtering

References

J&M Chapter 2: Words and Tokens (download)
NLTK Stemmers Documentation
SpaCy Lemmatization

Why Normalize Text?¶

In the last lecture, we learned that tokenization breaks text into pieces. But here’s a problem: the same concept appears in many surface forms.

# All of these express similar ideas about the verb "run"
variations = ["run", "runs", "running", "ran", "runner"]
print(f"5 tokens, but really 1-2 concepts: {variations}")

5 tokens, but really 1-2 concepts: ['run', 'runs', 'running', 'ran', 'runner']

If our model treats each form as completely unrelated, it must learn everything separately. Normalization solves this by reducing words to a standard form, shrinking our vocabulary while preserving meaning.

We’ll cover three main techniques:

Case folding — the simplest normalization
Stemming — fast, crude suffix removal
Lemmatization — accurate, linguistic reduction

Plus regular expressions as a tool for pattern-based cleaning.

Case Folding and Basic Cleanup¶

The simplest normalization: convert everything to lowercase.

text = "The Quick Brown Fox Jumps Over THE LAZY DOG"
print(text.lower())

the quick brown fox jumps over the lazy dog

This collapses “The”, “THE”, and “the” into one token. But be careful—sometimes case matters:

“US” (country) vs. “us” (pronoun)
“Apple” (company) vs. “apple” (fruit)
Named entities often use capitalization

For most tasks, lowercasing helps. For NER, you might keep it.

import re

def basic_normalize(text):
    """Simple normalization: lowercase and collapse whitespace."""
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)  # Collapse multiple spaces
    return text.strip()

messy = "  The   Quick\n\nBrown   Fox  "
print(f"Before: '{messy}'")
print(f"After:  '{basic_normalize(messy)}'")

Before: '  The   Quick

Brown   Fox  '
After:  'the quick brown fox'

Stemming: The Fast and Crude Approach¶

Stemming removes suffixes using heuristic rules to find a word’s “stem.” It’s fast but imprecise.

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

words = ["running", "runs", "ran", "studies", "studying", "happier", "happiness"]

print(f"{'Word':<12} {'Porter':<12} {'Lancaster':<12} {'Snowball':<12}")
print("-" * 48)
for word in words:
    print(f"{word:<12} {porter.stem(word):<12} {lancaster.stem(word):<12} {snowball.stem(word):<12}")

Word         Porter       Lancaster    Snowball    
------------------------------------------------
running      run          run          run         
runs         run          run          run         
ran          ran          ran          ran         
studies      studi        study        studi       
studying     studi        study        studi       
happier      happier      happy        happier     
happiness    happi        happy        happi

Notice the problems:

“studies” → “studi” (not a real word!)
“ran” → “ran” (irregular verb not handled)
Lancaster is aggressive: “happiness” → “happy” but sometimes over-stems

When to Use Stemming¶

Stemming works well when:

Speed matters more than precision
Downstream task is robust to slight errors (search, basic classification)
You need language-agnostic processing (Snowball supports many languages)

# Stemming for search: "running shoes" matches "run shoe"
query = "running shoes"
document = "These shoes are great for runners who run daily"

query_stems = [porter.stem(w) for w in query.lower().split()]
doc_stems = [porter.stem(w) for w in document.lower().split()]

print(f"Query stems: {query_stems}")
print(f"Doc stems:   {doc_stems}")
print(f"Overlap:     {set(query_stems) & set(doc_stems)}")

Query stems: ['run', 'shoe']
Doc stems:   ['these', 'shoe', 'are', 'great', 'for', 'runner', 'who', 'run', 'daili']
Overlap:     {'run', 'shoe'}

Lemmatization: The Linguistic Approach¶

Lemmatization uses vocabulary and morphological analysis to find a word’s dictionary form (lemma). It’s slower but more accurate.

import spacy

nlp = spacy.load("en_core_web_sm")

words = ["running", "runs", "ran", "better", "studies", "mice", "are", "was"]

print(f"{'Word':<12} {'Lemma':<12} {'POS':<8}")
print("-" * 32)
for word in words:
    doc = nlp(word)
    token = doc[0]
    print(f"{word:<12} {token.lemma_:<12} {token.pos_:<8}")

Word         Lemma        POS     
--------------------------------
running      run          VERB    
runs         run          VERB    
ran          run          VERB    
better       well         ADV     
studies      study        NOUN    
mice         mouse        NOUN    
are          be           AUX     
was          be           AUX

Key differences from stemming:

“ran” → “run” (handles irregular verbs!)
“better” → “good” (handles irregular adjectives!)
“mice” → “mouse” (handles irregular plurals!)
“are”/“was” → “be” (maps conjugations to infinitive)

Lemmatization Needs Context¶

Lemmatization depends on part-of-speech. The word “meeting” could be a noun or a verb:

sentences = [
    "I am meeting my friend.",  # meeting = verb
    "The meeting was boring.",  # meeting = noun
]

for sent in sentences:
    doc = nlp(sent)
    for token in doc:
        if token.text.lower() == "meeting":
            print(f"'{sent}'")
            print(f"  'meeting' -> lemma='{token.lemma_}', pos={token.pos_}")

'I am meeting my friend.'
  'meeting' -> lemma='meet', pos=VERB
'The meeting was boring.'
  'meeting' -> lemma='meeting', pos=NOUN

SpaCy handles this automatically because it runs POS tagging before lemmatization.

Stemming vs. Lemmatization: Quick Comparison¶

test_words = ["running", "better", "studies", "wolves", "caring", "happily"]

print(f"{'Word':<12} {'Porter Stem':<14} {'SpaCy Lemma':<12}")
print("-" * 38)
for word in test_words:
    stem = porter.stem(word)
    doc = nlp(word)
    lemma = doc[0].lemma_
    print(f"{word:<12} {stem:<14} {lemma:<12}")

Word         Porter Stem    SpaCy Lemma 
--------------------------------------
running      run            run         
better       better         well        
studies      studi          study       
wolves       wolv           wolf        
caring       care           care        
happily      happili        happily

Aspect	Stemming	Lemmatization
Speed	Fast	Slower
Output	May be non-words	Always valid words
Accuracy	Heuristic rules	Linguistic analysis
Context	Ignores context	Uses POS tags
Best for	Search, IR	Chatbots, analysis

Exercise 2.5: Vocabulary Reduction

Process this paragraph with both stemming and lemmatization:

text = """The researchers studied various studies about studying.
They ran experiments where runners were running on treadmills.
The better results came from the best athletes who performed well."""

Count unique tokens after each normalization method
Which method produces more valid English words?
Which would you choose for a search engine? A chatbot?

Stop Words: To Remove or Not?¶

Stop words are common words (the, is, at, which) that often carry little semantic meaning.

# SpaCy has a built-in stop word list
print(f"SpaCy has {len(nlp.Defaults.stop_words)} stop words")
print(f"Sample: {list(nlp.Defaults.stop_words)[:15]}")

SpaCy has 326 stop words
Sample: ['the', 'various', 'side', 'any', 'toward', 'besides', 'whoever', 'seems', "'m", 'mine', 'ourselves', 'not', 'against', 'well', 'due']

# Filtering stop words
text = "The quick brown fox jumps over the lazy dog"
doc = nlp(text)

content_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(f"Original: {text}")
print(f"Content words: {content_words}")

Original: The quick brown fox jumps over the lazy dog
Content words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When Stop Words Matter¶

Don’t blindly remove stop words! They matter for:

Phrases: “to be or not to be” loses meaning without stop words
Negation: “not good” vs. “good”
Language models: Stop words are part of fluent text

# Negation example: stop word removal can flip meaning!
texts = ["This movie is not good", "This movie is good"]

for text in texts:
    doc = nlp(text)
    filtered = [t.text for t in doc if not t.is_stop]
    print(f"Original: '{text}' -> Filtered: {filtered}")

Original: 'This movie is not good' -> Filtered: ['movie', 'good']
Original: 'This movie is good' -> Filtered: ['movie', 'good']

Both sentences become ['movie', 'good']! For sentiment analysis, this is a disaster.

Regular Expressions for Text Cleaning¶

Regular expressions are patterns for matching text. They’re essential for cleaning messy data.

Core Patterns¶

import re

text = "Contact us at support@example.com or call 555-123-4567!"

# Character classes
print("Digits:", re.findall(r'\d+', text))        # \d = digit
print("Words:", re.findall(r'\w+', text))         # \w = word character
print("Non-word:", re.findall(r'\W+', text))      # \W = non-word

Digits: ['555', '123', '4567']
Words: ['Contact', 'us', 'at', 'support', 'example', 'com', 'or', 'call', '555', '123', '4567']
Non-word: [' ', ' ', ' ', '@', '.', ' ', ' ', ' ', '-', '-', '!']

Common NLP Cleaning Patterns¶

def clean_text(text):
    """Clean text using common regex patterns."""
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove email addresses
    text = re.sub(r'\S+@\S+\.\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

messy = "Check <b>this</b> out: https://example.com and email me@test.com  please!"
print(f"Before: {messy}")
print(f"After:  {clean_text(messy)}")

Before: Check <b>this</b> out: https://example.com and email me@test.com  please!
After:  Check this out: and email please!

Quick Regex Reference¶

Pattern	Matches	Example
`\d`	Digit	`\d+` matches “123”
`\w`	Word char (letter, digit, _)	`\w+` matches “hello_world”
`\s`	Whitespace	`\s+` matches spaces, tabs, newlines
`.`	Any character	`a.c` matches “abc”, “a1c”
`*`	0 or more	`ab*` matches “a”, “ab”, “abbb”
`+`	1 or more	`ab+` matches “ab”, “abbb” (not “a”)
`?`	0 or 1	`colou?r` matches “color”, “colour”
`[abc]`	Character set	`[aeiou]` matches vowels
`^`	Start of string	`^Hello` matches “Hello world”
`$`	End of string	`world$` matches “Hello world”

Exercise 2.6: Regex Cleaning

Write regex patterns to:

Remove all punctuation except apostrophes (keep “don’t” intact)
Extract all hashtags from a tweet (e.g., “#NLP” → “NLP”)
Normalize repeated characters (“sooooo” → “soo”)
Extract prices in format “$XX.XX”

import re

# 1. Remove punctuation except apostrophes
text1 = "Hello, world! Don't forget: it's important."
# Your pattern here

# 2. Extract hashtags
text2 = "Learning #NLP and #MachineLearning is fun!"
# Your pattern here

# 3. Normalize repeated chars (keep max 2)
text3 = "This is sooooo gooood!!!"
# Your pattern here

# 4. Extract prices
text4 = "The book costs $29.99 and shipping is $4.50"
# Your pattern here

Putting It Together¶

Here’s a flexible normalization function combining our techniques:

import re
import spacy

nlp = spacy.load("en_core_web_sm")

def normalize_text(text, lowercase=True, remove_punct=True,
                   lemmatize=True, remove_stops=False):
    """Normalize text with configurable options."""
    # Basic cleaning
    text = re.sub(r'https?://\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text)          # Normalize whitespace

    if lowercase:
        text = text.lower()

    # Process with SpaCy
    doc = nlp(text)

    tokens = []
    for token in doc:
        if remove_punct and token.is_punct:
            continue
        if remove_stops and token.is_stop:
            continue

        if lemmatize:
            tokens.append(token.lemma_.lower() if lowercase else token.lemma_)
        else:
            tokens.append(token.text.lower() if lowercase else token.text)

    return tokens

# Test it
text = "The researchers were studying NLP. They ran many experiments!"
print("Default:", normalize_text(text))
print("No lemma:", normalize_text(text, lemmatize=False))
print("No stops:", normalize_text(text, remove_stops=True))

Default: ['the', 'researcher', 'be', 'study', 'nlp', 'they', 'run', 'many', 'experiment']
No lemma: ['the', 'researchers', 'were', 'studying', 'nlp', 'they', 'ran', 'many', 'experiments']
No stops: ['researcher', 'study', 'nlp', 'run', 'experiment']

When to Use What¶

Technique	Use When	Avoid When
Lowercasing	Most tasks	NER, case-sensitive domains
Stemming	Search, IR, speed critical	Need valid words, accuracy critical
Lemmatization	Analysis, chatbots, need valid words	Speed critical, no SpaCy available
Stop word removal	Classification, clustering, topic modeling	Language modeling, negation matters
Regex cleaning	URLs, HTML, formatting noise	Complex linguistic patterns

Wrap-Up¶

Key Takeaways¶

What’s Next¶

In the next lecture, we’ll tackle real-world messiness: encoding issues, multilingual text, social media content, and building robust preprocessing pipelines that handle whatever text gets thrown at them.