Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Tokenization: Breaking Text into Pieces

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


The First Cut is the Deepest

Every NLP pipeline begins the same way: you have raw text, and you need to turn it into something a computer can process. But here’s a deceptively simple question — where do you split the text?

Consider this sentence:

“Dr. Smith’s AI-powered chatbot didn’t work on the $100M project.”

Quick — how many words are there? If you counted 10, 11, or 12, you’re not alone. Is “didn’t” one word or two? What about “AI-powered”? Is “$100M” a single unit or three?

This is the problem of tokenization — and it turns out to be one of the most consequential decisions in NLP. Get it wrong, and everything downstream suffers.


Why Tokenization Matters

Before we dive into how to tokenize, let’s understand why it matters so much.

Tokens as the Atomic Unit

A token is the smallest unit of meaning we process. Tokens become the building blocks for everything:

Think of tokenization as deciding the resolution of your camera. Zoom in too much (character-level), and you lose the big picture. Zoom out too much, and you lose flexibility.

The Vocabulary Problem

Suppose you’re building a sentiment analysis model trained on “happy.” What happens when it encounters “happier”, “unhappy”, “happy-go-lucky”, or “happyyyy”?

If your tokenizer treats each of these as completely separate, unrelated units, your model has to learn everything from scratch for each variant. But if your tokenizer can recognize that “happier” = “happy” + “-er”, the model can leverage what it already knows.

This is the vocabulary explosion problem: natural language has essentially infinite surface forms, but we need a finite vocabulary to build practical models.


Three Approaches to Tokenization

Word-Level Tokenization

The intuitive approach: split on whitespace and punctuation.

import re

def simple_tokenize(text):
    """Split on whitespace and separate punctuation."""
    text = re.sub(r'([.,!?;:])', r' \1 ', text)
    return text.split()

# Test it
text = "Hello, world! How are you today?"
print("Simple split:", text.split())
print("With punct:  ", simple_tokenize(text))
Simple split: ['Hello,', 'world!', 'How', 'are', 'you', 'today?']
With punct:   ['Hello', ',', 'world', '!', 'How', 'are', 'you', 'today', '?']

Advantages: Intuitive, fast, preserves whole words

Disadvantages:

# Demonstrating vocabulary explosion
sample_texts = [
    "I am happy",
    "I am happier",
    "I am happiest",
    "I am unhappy",
    "I am happy-go-lucky",
    "I am sooo happy",
    "I am HAPPY",
]

all_tokens = set()
for text in sample_texts:
    all_tokens.update(simple_tokenize(text.lower()))

print(f"6 sentences, similar meaning -> {len(all_tokens)} unique tokens")
print("Tokens:", sorted(all_tokens))
6 sentences, similar meaning -> 8 unique tokens
Tokens: ['am', 'happier', 'happiest', 'happy', 'happy-go-lucky', 'i', 'sooo', 'unhappy']

Character-Level Tokenization

The opposite extreme: every character is a token.

text = "Hello, world!"
char_tokens = list(text)
print(f"'{text}' -> {len(char_tokens)} tokens: {char_tokens}")
'Hello, world!' -> 13 tokens: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']

Advantages: Tiny vocabulary (~100 chars), no OOV problem, language-agnostic

Disadvantages: Long sequences, no semantic units (‘c’,‘a’,‘t’ vs. “cat”), harder to learn

Subword Tokenization: The Best of Both Worlds

Modern NLP has converged on a middle ground: subword tokenization. The key insight is elegant:

Keep frequent words as single tokens. Break rare words into meaningful pieces.

“unhappiness” might become ["un", "happy", "ness"]. The model learns that “un-” means negation and “-ness” indicates a noun. This elegantly solves the vocabulary problem:

AlgorithmUsed ByKey Idea
BPE (Byte Pair Encoding)GPT, LLaMAMerge most frequent character pairs
WordPieceBERTSimilar, uses likelihood not frequency
SentencePieceT5, mBERTLanguage-agnostic, raw bytes

We’ll explore BPE in detail in Week 3 when we study text representation. For now, the key insight is that subword tokenization is the industry standard for neural NLP.

# Let's see how a real subword tokenizer handles our examples
# (This requires the transformers library - we'll use it more in later weeks)

try:
    from transformers import AutoTokenizer

    # Load GPT-2's tokenizer (uses BPE)
    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    examples = ["happy", "happier", "unhappy", "unhappiness", "happy-go-lucky"]

    for word in examples:
        tokens = tokenizer.tokenize(word)
        print(f"{word:20} -> {tokens}")

except ImportError:
    print("transformers library not installed. Run: uv add transformers")
    print("We'll explore this more in Week 7!")
Loading...
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading...
Loading...
Loading...
Loading...
happy                -> ['happy']
happier              -> ['h', 'app', 'ier']
unhappy              -> ['un', 'happy']
unhappiness          -> ['un', 'h', 'appiness']
happy-go-lucky       -> ['happy', '-', 'go', '-', 'l', 'ucky']

Notice how the tokenizer breaks “unhappiness” into meaningful pieces that capture the prefix, root, and suffix.


SpaCy’s Tokenizer

SpaCy uses a sophisticated rule-based tokenizer that handles most of English’s quirks.

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Process some text
text = "Dr. Smith's AI-powered chatbot didn't work on the $100M project."
doc = nlp(text)

# Extract tokens
for token in doc:
    print(f"{token.text:15} | {token.pos_:6} | {token.dep_:10}")
Dr.             | PROPN  | compound  
Smith           | PROPN  | poss      
's              | PART   | case      
AI              | PROPN  | npadvmod  
-               | PUNCT  | punct     
powered         | VERB   | amod      
chatbot         | NOUN   | nsubj     
did             | AUX    | aux       
n't             | PART   | neg       
work            | VERB   | ROOT      
on              | ADP    | prep      
the             | DET    | det       
$               | SYM    | nmod      
100             | NUM    | nummod    
M               | PROPN  | compound  
project         | NOUN   | pobj      
.               | PUNCT  | punct     

Notice how SpaCy handles tricky cases:

How It Works

SpaCy’s tokenizer follows a principled process:

  1. Split on whitespace to get initial chunks

  2. Apply exception rules for special cases (abbreviations, contractions)

  3. Apply prefix/suffix rules to separate punctuation

  4. Check against exceptions to prevent over-splitting

# SpaCy handles many edge cases automatically
edge_cases = [
    "Dr. Smith went to Washington, D.C.",
    "I bought 10,000 shares at $42.50",
    "Email me at test@example.com",
    "It's a win-win situation, isn't it?",
]

for text in edge_cases:
    doc = nlp(text)
    print(f"{text}")
    print(f"  -> {[t.text for t in doc]}\n")
Dr. Smith went to Washington, D.C.
  -> ['Dr.', 'Smith', 'went', 'to', 'Washington', ',', 'D.C.']

I bought 10,000 shares at $42.50
  -> ['I', 'bought', '10,000', 'shares', 'at', '$', '42.50']

Email me at test@example.com
  -> ['Email', 'me', 'at', 'test@example.com']

It's a win-win situation, isn't it?
  -> ['It', "'s", 'a', 'win', '-', 'win', 'situation', ',', 'is', "n't", 'it', '?']

Customizing the Tokenizer

Sometimes you need special handling for domain-specific patterns.

from spacy.lang.en import English
from spacy.symbols import ORTH

# Create a blank English tokenizer
nlp_custom = English()
tokenizer = nlp_custom.tokenizer

# Default: "lemme" stays as one token
text = "lemme see that"
print("Default:", [t.text for t in nlp_custom(text)])

# Add special case: split "lemme" -> "lem" + "me"
nlp_custom.tokenizer.add_special_case("lemme", [{ORTH: "lem"}, {ORTH: "me"}])
print("Custom: ", [t.text for t in doc])
Default: ['lemme', 'see', 'that']
Custom:  ['It', "'s", 'a', 'win', '-', 'win', 'situation', ',', 'is', "n't", 'it', '?']
# Another common need: keeping certain patterns together
# Let's say we don't want to split hashtags

import re
from spacy.lang.en import English

nlp_hashtag = English()

# Add hashtags as special tokens using a custom regex
# We'll add a simple prefix rule
prefixes = list(nlp_hashtag.Defaults.prefixes)
# Keep # attached to the following word
hashtag_prefix = r'(?<=#)'  # Negative lookbehind - don't split after #

nlp_hashtag.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search

# A simpler approach: add specific hashtags as special cases
for hashtag in ["#NLP", "#MachineLearning", "#AI"]:
    nlp_hashtag.tokenizer.add_special_case(hashtag, [{ORTH: hashtag}])

text = "Learning #NLP and #MachineLearning is fun!"
doc = nlp_hashtag(text)
print([t.text for t in doc])
['Learning', '#NLP', 'and', '#MachineLearning', 'is', 'fun', '!']

Edge Cases in the Wild

Real-world text is messy. Let’s explore common challenges.

Contractions

The same surface form can have different meanings:

contractions = [
    "I'll go tomorrow",      # I + will
    "She's happy",           # She + is (or has?)
    "They'd better hurry",   # They + had (or would?)
    "Won't you join us?",    # will + not (irregular!)
]

for text in contractions:
    doc = nlp(text)
    print(f"{text:25} -> {[t.text for t in doc]}")
I'll go tomorrow          -> ['I', "'ll", 'go', 'tomorrow']
She's happy               -> ['She', "'s", 'happy']
They'd better hurry       -> ['They', "'d", 'better', 'hurry']
Won't you join us?        -> ['Wo', "n't", 'you', 'join', 'us', '?']

Numbers, URLs, and Technical Content

technical = [
    "$1,234.56",
    "https://example.com/path?q=1",
    "192.168.1.1",
    "foo_bar() returns None",
]

for text in technical:
    doc = nlp(text)
    print(f"{text:30} -> {[t.text for t in doc]}")
$1,234.56                      -> ['$', '1,234.56']
https://example.com/path?q=1   -> ['https://example.com/path?q=1']
192.168.1.1                    -> ['192.168.1.1']
foo_bar() returns None         -> ['foo_bar', '(', ')', 'returns', 'None']

Social Media

Social media breaks every rule: emojis, slang, non-standard spelling, platform conventions.

social = [
    "OMG this is sooo good!!! 😍😍😍",
    "Just watched #TheMatrix4 🤯",
    "@username check this out lol",
    "ngl this hits different fr fr",
]

for text in social:
    doc = nlp(text)
    print(f"{text}")
    print(f"  -> {[t.text for t in doc]}\n")
OMG this is sooo good!!! 😍😍😍
  -> ['OMG', 'this', 'is', 'sooo', 'good', '!', '!', '!', '😍', '😍', '😍']

Just watched #TheMatrix4 🤯
  -> ['Just', 'watched', '#', 'TheMatrix4', '🤯']

@username check this out lol
  -> ['@username', 'check', 'this', 'out', 'lol']

ngl this hits different fr fr
  -> ['ngl', 'this', 'hits', 'different', 'fr', 'fr']

Challenges: emojis as tokens, slang (“rn”, “ngl”), non-standard spelling (“sooo”), @mentions, #hashtags. For social media NLP, consider specialized libraries like ekphrasis or tweet-preprocessor.


Tokenization Affects Everything Downstream

To see why tokenization matters so much, let’s look at how different choices affect word counting—a foundational NLP task.

from collections import Counter

sample_text = """
Natural language processing (NLP) is a subfield of artificial intelligence.
NLP combines computational linguistics with machine learning. The goal of NLP
is to enable computers to understand, interpret, and generate human language.
"""

# Approach 1: Simple whitespace splitting
simple_tokens = sample_text.lower().split()
simple_counts = Counter(simple_tokens).most_common(5)

# Approach 2: SpaCy (excludes punctuation)
doc = nlp(sample_text)
spacy_tokens = [t.text.lower() for t in doc if not t.is_punct and not t.is_space]
spacy_counts = Counter(spacy_tokens).most_common(5)

print("Simple split top 5:", simple_counts)
print("SpaCy top 5:       ", spacy_counts)
Simple split top 5: [('is', 2), ('of', 2), ('nlp', 2), ('to', 2), ('natural', 1)]
SpaCy top 5:        [('nlp', 3), ('language', 2), ('is', 2), ('of', 2), ('to', 2)]
# The differences become clearer with lemmatization
lemma_tokens = [t.lemma_.lower() for t in doc if not t.is_punct and not t.is_space]
lemma_counts = Counter(lemma_tokens).most_common(5)

print("With lemmas:", lemma_counts)
With lemmas: [('nlp', 3), ('language', 2), ('be', 2), ('of', 2), ('to', 2)]

The differences might seem small, but they compound across millions of documents. In information retrieval, IR, proper tokenization can mean the difference between finding relevant results or missing them entirely.

Choosing the Right Approach

Use CaseRecommended Approach
Search enginesAggressive normalization, small vocabulary
Language modelsSubword tokenization (BPE/WordPiece)
Domain-specificCustom rules for domain patterns
Quick prototypingSpaCy defaults

Wrap-Up

Key Takeaways

What’s Next

In the next lecture, we’ll tackle text normalization: stemming vs. lemmatization, stop word removal, and regular expressions for text cleaning.