Tokenization: Breaking Text into Pieces
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
Basic Python programming
Week 1: What is NLP (understanding of tokens and SpaCy basics)
Outcomes
Understand why tokenization is the critical first step in any NLP pipeline
Compare word-level, character-level, and subword tokenization strategies
Use SpaCy’s tokenizer and understand how it makes decisions
Handle edge cases: contractions, punctuation, URLs, and special characters
Recognize how tokenization choices affect downstream NLP tasks
References
J&M Chapter 2: Words and Tokens (download)
The First Cut is the Deepest¶
Every NLP pipeline begins the same way: you have raw text, and you need to turn it into something a computer can process. But here’s a deceptively simple question — where do you split the text?
Consider this sentence:
“Dr. Smith’s AI-powered chatbot didn’t work on the $100M project.”
Quick — how many words are there? If you counted 10, 11, or 12, you’re not alone. Is “didn’t” one word or two? What about “AI-powered”? Is “$100M” a single unit or three?
This is the problem of tokenization — and it turns out to be one of the most consequential decisions in NLP. Get it wrong, and everything downstream suffers.
Why Tokenization Matters¶
Before we dive into how to tokenize, let’s understand why it matters so much.
Tokens as the Atomic Unit¶
A token is the smallest unit of meaning we process. Tokens become the building blocks for everything:
Counting: Word frequency, document length, vocabulary size
Embedding: Each token gets mapped to a vector
Classification: The model sees tokens, not raw characters
Generation: Language models predict the next token
Think of tokenization as deciding the resolution of your camera. Zoom in too much (character-level), and you lose the big picture. Zoom out too much, and you lose flexibility.
The Vocabulary Problem¶
Suppose you’re building a sentiment analysis model trained on “happy.” What happens when it encounters “happier”, “unhappy”, “happy-go-lucky”, or “happyyyy”?
If your tokenizer treats each of these as completely separate, unrelated units, your model has to learn everything from scratch for each variant. But if your tokenizer can recognize that “happier” = “happy” + “-er”, the model can leverage what it already knows.
This is the vocabulary explosion problem: natural language has essentially infinite surface forms, but we need a finite vocabulary to build practical models.
Three Approaches to Tokenization¶
Word-Level Tokenization¶
The intuitive approach: split on whitespace and punctuation.
import re
def simple_tokenize(text):
"""Split on whitespace and separate punctuation."""
text = re.sub(r'([.,!?;:])', r' \1 ', text)
return text.split()
# Test it
text = "Hello, world! How are you today?"
print("Simple split:", text.split())
print("With punct: ", simple_tokenize(text))Simple split: ['Hello,', 'world!', 'How', 'are', 'you', 'today?']
With punct: ['Hello', ',', 'world', '!', 'How', 'are', 'you', 'today', '?']
Advantages: Intuitive, fast, preserves whole words
Disadvantages:
Vocabulary explosion: Every misspelling, inflection, and compound creates a new token
Out-of-vocabulary (OOV) problem: What do you do with words never seen in training?
Language-dependent: Doesn’t work for Chinese, Japanese, Thai (no spaces)
# Demonstrating vocabulary explosion
sample_texts = [
"I am happy",
"I am happier",
"I am happiest",
"I am unhappy",
"I am happy-go-lucky",
"I am sooo happy",
"I am HAPPY",
]
all_tokens = set()
for text in sample_texts:
all_tokens.update(simple_tokenize(text.lower()))
print(f"6 sentences, similar meaning -> {len(all_tokens)} unique tokens")
print("Tokens:", sorted(all_tokens))6 sentences, similar meaning -> 8 unique tokens
Tokens: ['am', 'happier', 'happiest', 'happy', 'happy-go-lucky', 'i', 'sooo', 'unhappy']
Character-Level Tokenization¶
The opposite extreme: every character is a token.
text = "Hello, world!"
char_tokens = list(text)
print(f"'{text}' -> {len(char_tokens)} tokens: {char_tokens}")'Hello, world!' -> 13 tokens: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Advantages: Tiny vocabulary (~100 chars), no OOV problem, language-agnostic
Disadvantages: Long sequences, no semantic units (‘c’,‘a’,‘t’ vs. “cat”), harder to learn
Subword Tokenization: The Best of Both Worlds¶
Modern NLP has converged on a middle ground: subword tokenization. The key insight is elegant:
Keep frequent words as single tokens. Break rare words into meaningful pieces.
“unhappiness” might become ["un", "happy", "ness"]. The model learns that “un-” means negation and “-ness” indicates a noun. This elegantly solves the vocabulary problem:
Common words like “the”, “happy”, “run” stay as single tokens
Rare words get broken into known pieces
Misspellings and new words can still be represented
Vocabulary stays manageable (typically 30K-50K tokens)
| Algorithm | Used By | Key Idea |
|---|---|---|
| BPE (Byte Pair Encoding) | GPT, LLaMA | Merge most frequent character pairs |
| WordPiece | BERT | Similar, uses likelihood not frequency |
| SentencePiece | T5, mBERT | Language-agnostic, raw bytes |
We’ll explore BPE in detail in Week 3 when we study text representation. For now, the key insight is that subword tokenization is the industry standard for neural NLP.
# Let's see how a real subword tokenizer handles our examples
# (This requires the transformers library - we'll use it more in later weeks)
try:
from transformers import AutoTokenizer
# Load GPT-2's tokenizer (uses BPE)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
examples = ["happy", "happier", "unhappy", "unhappiness", "happy-go-lucky"]
for word in examples:
tokens = tokenizer.tokenize(word)
print(f"{word:20} -> {tokens}")
except ImportError:
print("transformers library not installed. Run: uv add transformers")
print("We'll explore this more in Week 7!")Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
happy -> ['happy']
happier -> ['h', 'app', 'ier']
unhappy -> ['un', 'happy']
unhappiness -> ['un', 'h', 'appiness']
happy-go-lucky -> ['happy', '-', 'go', '-', 'l', 'ucky']
Notice how the tokenizer breaks “unhappiness” into meaningful pieces that capture the prefix, root, and suffix.
SpaCy’s Tokenizer¶
SpaCy uses a sophisticated rule-based tokenizer that handles most of English’s quirks.
import spacy
# Load the English model
nlp = spacy.load("en_core_web_sm")
# Process some text
text = "Dr. Smith's AI-powered chatbot didn't work on the $100M project."
doc = nlp(text)
# Extract tokens
for token in doc:
print(f"{token.text:15} | {token.pos_:6} | {token.dep_:10}")Dr. | PROPN | compound
Smith | PROPN | poss
's | PART | case
AI | PROPN | npadvmod
- | PUNCT | punct
powered | VERB | amod
chatbot | NOUN | nsubj
did | AUX | aux
n't | PART | neg
work | VERB | ROOT
on | ADP | prep
the | DET | det
$ | SYM | nmod
100 | NUM | nummod
M | PROPN | compound
project | NOUN | pobj
. | PUNCT | punct
Notice how SpaCy handles tricky cases:
Dr.stays together (abbreviation)Smith's→Smith+'s(possessive)didn't→did+n't(contraction)
How It Works¶
SpaCy’s tokenizer follows a principled process:
Split on whitespace to get initial chunks
Apply exception rules for special cases (abbreviations, contractions)
Apply prefix/suffix rules to separate punctuation
Check against exceptions to prevent over-splitting
# SpaCy handles many edge cases automatically
edge_cases = [
"Dr. Smith went to Washington, D.C.",
"I bought 10,000 shares at $42.50",
"Email me at test@example.com",
"It's a win-win situation, isn't it?",
]
for text in edge_cases:
doc = nlp(text)
print(f"{text}")
print(f" -> {[t.text for t in doc]}\n")Dr. Smith went to Washington, D.C.
-> ['Dr.', 'Smith', 'went', 'to', 'Washington', ',', 'D.C.']
I bought 10,000 shares at $42.50
-> ['I', 'bought', '10,000', 'shares', 'at', '$', '42.50']
Email me at test@example.com
-> ['Email', 'me', 'at', 'test@example.com']
It's a win-win situation, isn't it?
-> ['It', "'s", 'a', 'win', '-', 'win', 'situation', ',', 'is', "n't", 'it', '?']
Customizing the Tokenizer¶
Sometimes you need special handling for domain-specific patterns.
from spacy.lang.en import English
from spacy.symbols import ORTH
# Create a blank English tokenizer
nlp_custom = English()
tokenizer = nlp_custom.tokenizer
# Default: "lemme" stays as one token
text = "lemme see that"
print("Default:", [t.text for t in nlp_custom(text)])
# Add special case: split "lemme" -> "lem" + "me"
nlp_custom.tokenizer.add_special_case("lemme", [{ORTH: "lem"}, {ORTH: "me"}])
print("Custom: ", [t.text for t in doc])Default: ['lemme', 'see', 'that']
Custom: ['It', "'s", 'a', 'win', '-', 'win', 'situation', ',', 'is', "n't", 'it', '?']
# Another common need: keeping certain patterns together
# Let's say we don't want to split hashtags
import re
from spacy.lang.en import English
nlp_hashtag = English()
# Add hashtags as special tokens using a custom regex
# We'll add a simple prefix rule
prefixes = list(nlp_hashtag.Defaults.prefixes)
# Keep # attached to the following word
hashtag_prefix = r'(?<=#)' # Negative lookbehind - don't split after #
nlp_hashtag.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search
# A simpler approach: add specific hashtags as special cases
for hashtag in ["#NLP", "#MachineLearning", "#AI"]:
nlp_hashtag.tokenizer.add_special_case(hashtag, [{ORTH: hashtag}])
text = "Learning #NLP and #MachineLearning is fun!"
doc = nlp_hashtag(text)
print([t.text for t in doc])['Learning', '#NLP', 'and', '#MachineLearning', 'is', 'fun', '!']
Edge Cases in the Wild¶
Real-world text is messy. Let’s explore common challenges.
Contractions¶
The same surface form can have different meanings:
contractions = [
"I'll go tomorrow", # I + will
"She's happy", # She + is (or has?)
"They'd better hurry", # They + had (or would?)
"Won't you join us?", # will + not (irregular!)
]
for text in contractions:
doc = nlp(text)
print(f"{text:25} -> {[t.text for t in doc]}")I'll go tomorrow -> ['I', "'ll", 'go', 'tomorrow']
She's happy -> ['She', "'s", 'happy']
They'd better hurry -> ['They', "'d", 'better', 'hurry']
Won't you join us? -> ['Wo', "n't", 'you', 'join', 'us', '?']
Numbers, URLs, and Technical Content¶
technical = [
"$1,234.56",
"https://example.com/path?q=1",
"192.168.1.1",
"foo_bar() returns None",
]
for text in technical:
doc = nlp(text)
print(f"{text:30} -> {[t.text for t in doc]}")$1,234.56 -> ['$', '1,234.56']
https://example.com/path?q=1 -> ['https://example.com/path?q=1']
192.168.1.1 -> ['192.168.1.1']
foo_bar() returns None -> ['foo_bar', '(', ')', 'returns', 'None']
Social Media¶
Social media breaks every rule: emojis, slang, non-standard spelling, platform conventions.
social = [
"OMG this is sooo good!!! 😍😍😍",
"Just watched #TheMatrix4 🤯",
"@username check this out lol",
"ngl this hits different fr fr",
]
for text in social:
doc = nlp(text)
print(f"{text}")
print(f" -> {[t.text for t in doc]}\n")OMG this is sooo good!!! 😍😍😍
-> ['OMG', 'this', 'is', 'sooo', 'good', '!', '!', '!', '😍', '😍', '😍']
Just watched #TheMatrix4 🤯
-> ['Just', 'watched', '#', 'TheMatrix4', '🤯']
@username check this out lol
-> ['@username', 'check', 'this', 'out', 'lol']
ngl this hits different fr fr
-> ['ngl', 'this', 'hits', 'different', 'fr', 'fr']
Challenges: emojis as tokens, slang (“rn”, “ngl”), non-standard spelling (“sooo”), @mentions, #hashtags. For social media NLP, consider specialized libraries like ekphrasis or tweet-preprocessor.
Tokenization Affects Everything Downstream¶
To see why tokenization matters so much, let’s look at how different choices affect word counting—a foundational NLP task.
from collections import Counter
sample_text = """
Natural language processing (NLP) is a subfield of artificial intelligence.
NLP combines computational linguistics with machine learning. The goal of NLP
is to enable computers to understand, interpret, and generate human language.
"""
# Approach 1: Simple whitespace splitting
simple_tokens = sample_text.lower().split()
simple_counts = Counter(simple_tokens).most_common(5)
# Approach 2: SpaCy (excludes punctuation)
doc = nlp(sample_text)
spacy_tokens = [t.text.lower() for t in doc if not t.is_punct and not t.is_space]
spacy_counts = Counter(spacy_tokens).most_common(5)
print("Simple split top 5:", simple_counts)
print("SpaCy top 5: ", spacy_counts)Simple split top 5: [('is', 2), ('of', 2), ('nlp', 2), ('to', 2), ('natural', 1)]
SpaCy top 5: [('nlp', 3), ('language', 2), ('is', 2), ('of', 2), ('to', 2)]
# The differences become clearer with lemmatization
lemma_tokens = [t.lemma_.lower() for t in doc if not t.is_punct and not t.is_space]
lemma_counts = Counter(lemma_tokens).most_common(5)
print("With lemmas:", lemma_counts)With lemmas: [('nlp', 3), ('language', 2), ('be', 2), ('of', 2), ('to', 2)]
The differences might seem small, but they compound across millions of documents. In information retrieval, IR, proper tokenization can mean the difference between finding relevant results or missing them entirely.
Choosing the Right Approach¶
| Use Case | Recommended Approach |
|---|---|
| Search engines | Aggressive normalization, small vocabulary |
| Language models | Subword tokenization (BPE/WordPiece) |
| Domain-specific | Custom rules for domain patterns |
| Quick prototyping | SpaCy defaults |
Wrap-Up¶
Key Takeaways¶
What’s Next¶
In the next lecture, we’ll tackle text normalization: stemming vs. lemmatization, stop word removal, and regular expressions for text cleaning.