Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Text Normalization: Stemming, Lemmatization, and Regex

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Why Normalize Text?

In the last lecture, we learned that tokenization breaks text into pieces. But here’s a problem: the same concept appears in many surface forms.

# All of these express similar ideas about the verb "run"
variations = ["run", "runs", "running", "ran", "runner"]
print(f"5 tokens, but really 1-2 concepts: {variations}")
5 tokens, but really 1-2 concepts: ['run', 'runs', 'running', 'ran', 'runner']

If our model treats each form as completely unrelated, it must learn everything separately. Normalization solves this by reducing words to a standard form, shrinking our vocabulary while preserving meaning.

We’ll cover three main techniques:

  1. Case folding — the simplest normalization

  2. Stemming — fast, crude suffix removal

  3. Lemmatization — accurate, linguistic reduction

Plus regular expressions as a tool for pattern-based cleaning.


Case Folding and Basic Cleanup

The simplest normalization: convert everything to lowercase.

text = "The Quick Brown Fox Jumps Over THE LAZY DOG"
print(text.lower())
the quick brown fox jumps over the lazy dog

This collapses “The”, “THE”, and “the” into one token. But be careful—sometimes case matters:

For most tasks, lowercasing helps. For NER, you might keep it.

import re

def basic_normalize(text):
    """Simple normalization: lowercase and collapse whitespace."""
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)  # Collapse multiple spaces
    return text.strip()

messy = "  The   Quick\n\nBrown   Fox  "
print(f"Before: '{messy}'")
print(f"After:  '{basic_normalize(messy)}'")
Before: '  The   Quick

Brown   Fox  '
After:  'the quick brown fox'

Stemming: The Fast and Crude Approach

Stemming removes suffixes using heuristic rules to find a word’s “stem.” It’s fast but imprecise.

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

words = ["running", "runs", "ran", "studies", "studying", "happier", "happiness"]

print(f"{'Word':<12} {'Porter':<12} {'Lancaster':<12} {'Snowball':<12}")
print("-" * 48)
for word in words:
    print(f"{word:<12} {porter.stem(word):<12} {lancaster.stem(word):<12} {snowball.stem(word):<12}")
Word         Porter       Lancaster    Snowball    
------------------------------------------------
running      run          run          run         
runs         run          run          run         
ran          ran          ran          ran         
studies      studi        study        studi       
studying     studi        study        studi       
happier      happier      happy        happier     
happiness    happi        happy        happi       

Notice the problems:

When to Use Stemming

Stemming works well when:

# Stemming for search: "running shoes" matches "run shoe"
query = "running shoes"
document = "These shoes are great for runners who run daily"

query_stems = [porter.stem(w) for w in query.lower().split()]
doc_stems = [porter.stem(w) for w in document.lower().split()]

print(f"Query stems: {query_stems}")
print(f"Doc stems:   {doc_stems}")
print(f"Overlap:     {set(query_stems) & set(doc_stems)}")
Query stems: ['run', 'shoe']
Doc stems:   ['these', 'shoe', 'are', 'great', 'for', 'runner', 'who', 'run', 'daili']
Overlap:     {'run', 'shoe'}

Lemmatization: The Linguistic Approach

Lemmatization uses vocabulary and morphological analysis to find a word’s dictionary form (lemma). It’s slower but more accurate.

import spacy

nlp = spacy.load("en_core_web_sm")

words = ["running", "runs", "ran", "better", "studies", "mice", "are", "was"]

print(f"{'Word':<12} {'Lemma':<12} {'POS':<8}")
print("-" * 32)
for word in words:
    doc = nlp(word)
    token = doc[0]
    print(f"{word:<12} {token.lemma_:<12} {token.pos_:<8}")
Word         Lemma        POS     
--------------------------------
running      run          VERB    
runs         run          VERB    
ran          run          VERB    
better       well         ADV     
studies      study        NOUN    
mice         mouse        NOUN    
are          be           AUX     
was          be           AUX     

Key differences from stemming:

Lemmatization Needs Context

Lemmatization depends on part-of-speech. The word “meeting” could be a noun or a verb:

sentences = [
    "I am meeting my friend.",  # meeting = verb
    "The meeting was boring.",  # meeting = noun
]

for sent in sentences:
    doc = nlp(sent)
    for token in doc:
        if token.text.lower() == "meeting":
            print(f"'{sent}'")
            print(f"  'meeting' -> lemma='{token.lemma_}', pos={token.pos_}")
'I am meeting my friend.'
  'meeting' -> lemma='meet', pos=VERB
'The meeting was boring.'
  'meeting' -> lemma='meeting', pos=NOUN

SpaCy handles this automatically because it runs POS tagging before lemmatization.

Stemming vs. Lemmatization: Quick Comparison

test_words = ["running", "better", "studies", "wolves", "caring", "happily"]

print(f"{'Word':<12} {'Porter Stem':<14} {'SpaCy Lemma':<12}")
print("-" * 38)
for word in test_words:
    stem = porter.stem(word)
    doc = nlp(word)
    lemma = doc[0].lemma_
    print(f"{word:<12} {stem:<14} {lemma:<12}")
Word         Porter Stem    SpaCy Lemma 
--------------------------------------
running      run            run         
better       better         well        
studies      studi          study       
wolves       wolv           wolf        
caring       care           care        
happily      happili        happily     
AspectStemmingLemmatization
SpeedFastSlower
OutputMay be non-wordsAlways valid words
AccuracyHeuristic rulesLinguistic analysis
ContextIgnores contextUses POS tags
Best forSearch, IRChatbots, analysis

Stop Words: To Remove or Not?

Stop words are common words (the, is, at, which) that often carry little semantic meaning.

# SpaCy has a built-in stop word list
print(f"SpaCy has {len(nlp.Defaults.stop_words)} stop words")
print(f"Sample: {list(nlp.Defaults.stop_words)[:15]}")
SpaCy has 326 stop words
Sample: ['the', 'various', 'side', 'any', 'toward', 'besides', 'whoever', 'seems', "'m", 'mine', 'ourselves', 'not', 'against', 'well', 'due']
# Filtering stop words
text = "The quick brown fox jumps over the lazy dog"
doc = nlp(text)

content_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(f"Original: {text}")
print(f"Content words: {content_words}")
Original: The quick brown fox jumps over the lazy dog
Content words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When Stop Words Matter

Don’t blindly remove stop words! They matter for:

# Negation example: stop word removal can flip meaning!
texts = ["This movie is not good", "This movie is good"]

for text in texts:
    doc = nlp(text)
    filtered = [t.text for t in doc if not t.is_stop]
    print(f"Original: '{text}' -> Filtered: {filtered}")
Original: 'This movie is not good' -> Filtered: ['movie', 'good']
Original: 'This movie is good' -> Filtered: ['movie', 'good']

Both sentences become ['movie', 'good']! For sentiment analysis, this is a disaster.


Regular Expressions for Text Cleaning

Regular expressions are patterns for matching text. They’re essential for cleaning messy data.

Core Patterns

import re

text = "Contact us at support@example.com or call 555-123-4567!"

# Character classes
print("Digits:", re.findall(r'\d+', text))        # \d = digit
print("Words:", re.findall(r'\w+', text))         # \w = word character
print("Non-word:", re.findall(r'\W+', text))      # \W = non-word
Digits: ['555', '123', '4567']
Words: ['Contact', 'us', 'at', 'support', 'example', 'com', 'or', 'call', '555', '123', '4567']
Non-word: [' ', ' ', ' ', '@', '.', ' ', ' ', ' ', '-', '-', '!']

Common NLP Cleaning Patterns

def clean_text(text):
    """Clean text using common regex patterns."""
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove email addresses
    text = re.sub(r'\S+@\S+\.\S+', '', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

messy = "Check <b>this</b> out: https://example.com and email me@test.com  please!"
print(f"Before: {messy}")
print(f"After:  {clean_text(messy)}")
Before: Check <b>this</b> out: https://example.com and email me@test.com  please!
After:  Check this out: and email please!

Quick Regex Reference

PatternMatchesExample
\dDigit\d+ matches “123”
\wWord char (letter, digit, _)\w+ matches “hello_world”
\sWhitespace\s+ matches spaces, tabs, newlines
.Any charactera.c matches “abc”, “a1c”
*0 or moreab* matches “a”, “ab”, “abbb”
+1 or moreab+ matches “ab”, “abbb” (not “a”)
?0 or 1colou?r matches “color”, “colour”
[abc]Character set[aeiou] matches vowels
^Start of string^Hello matches “Hello world”
$End of stringworld$ matches “Hello world”

Putting It Together

Here’s a flexible normalization function combining our techniques:

import re
import spacy

nlp = spacy.load("en_core_web_sm")

def normalize_text(text, lowercase=True, remove_punct=True,
                   lemmatize=True, remove_stops=False):
    """Normalize text with configurable options."""
    # Basic cleaning
    text = re.sub(r'https?://\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text)          # Normalize whitespace

    if lowercase:
        text = text.lower()

    # Process with SpaCy
    doc = nlp(text)

    tokens = []
    for token in doc:
        if remove_punct and token.is_punct:
            continue
        if remove_stops and token.is_stop:
            continue

        if lemmatize:
            tokens.append(token.lemma_.lower() if lowercase else token.lemma_)
        else:
            tokens.append(token.text.lower() if lowercase else token.text)

    return tokens

# Test it
text = "The researchers were studying NLP. They ran many experiments!"
print("Default:", normalize_text(text))
print("No lemma:", normalize_text(text, lemmatize=False))
print("No stops:", normalize_text(text, remove_stops=True))
Default: ['the', 'researcher', 'be', 'study', 'nlp', 'they', 'run', 'many', 'experiment']
No lemma: ['the', 'researchers', 'were', 'studying', 'nlp', 'they', 'ran', 'many', 'experiments']
No stops: ['researcher', 'study', 'nlp', 'run', 'experiment']

When to Use What

TechniqueUse WhenAvoid When
LowercasingMost tasksNER, case-sensitive domains
StemmingSearch, IR, speed criticalNeed valid words, accuracy critical
LemmatizationAnalysis, chatbots, need valid wordsSpeed critical, no SpaCy available
Stop word removalClassification, clustering, topic modelingLanguage modeling, negation matters
Regex cleaningURLs, HTML, formatting noiseComplex linguistic patterns

Wrap-Up

Key Takeaways

What’s Next

In the next lecture, we’ll tackle real-world messiness: encoding issues, multilingual text, social media content, and building robust preprocessing pipelines that handle whatever text gets thrown at them.