SpaCy Pipelines: From Text to Annotations

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

Week 1: What is NLP (Doc, token, Span objects)
Week 2 Part 1: Tokenization
Week 2 Part 2: Text Normalization (lemmatization)

Outcomes

Understand what happens under the hood when you call nlp(text)
Identify built-in pipeline components and the annotations they produce
Inspect and modify pipeline components (add, remove, disable)
Create custom pipeline components using @Language.component
Use extension attributes to attach custom metadata to documents and tokens
Optimize processing speed with nlp.pipe() and selective component disabling

References

What Actually Happens When You Call `nlp()`?¶

We’ve been using SpaCy throughout this course, casually writing doc = nlp(text) and then accessing attributes like token.pos_, token.lemma_, and doc.ents. But have you stopped to wonder — where do all these annotations come from?

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion.")

# All of these "just work" - but how?
for token in doc[:5]:
    print(f"{token.text:10} | POS: {token.pos_:6} | Lemma: {token.lemma_:10} | Dep: {token.dep_}")

Apple      | POS: PROPN  | Lemma: Apple      | Dep: nsubj
is         | POS: AUX    | Lemma: be         | Dep: aux
looking    | POS: VERB   | Lemma: look       | Dep: ROOT
at         | POS: ADP    | Lemma: at         | Dep: prep
buying     | POS: VERB   | Lemma: buy        | Dep: pcomp

The answer is the processing pipeline — a sequence of components that transform raw text into richly annotated documents. Understanding this pipeline is key to using SpaCy effectively and customizing it for your specific needs.

The Pipeline Mental Model¶

When you call nlp(text), SpaCy doesn’t do everything at once. Instead, it follows a two-stage process:

Tokenization: The text string becomes a Doc object (a sequence of Token objects)
Pipeline Components: The Doc passes through each component in order, with each one adding annotations

Think of it like an assembly line. The tokenizer creates the basic product (tokens), and each subsequent station (component) adds more features: part-of-speech tags, dependency labels, named entities, and so on.

# Let's see what's in our pipeline
print("Pipeline components:", nlp.pipe_names)

Pipeline components: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

# We can also see the component objects themselves
for name, component in nlp.pipeline:
    print(f"{name:20} -> {type(component).__name__}")

tok2vec              -> Tok2Vec
tagger               -> Tagger
parser               -> DependencyParser
attribute_ruler      -> AttributeRuler
lemmatizer           -> EnglishLemmatizer
ner                  -> EntityRecognizer

The order matters! Some components depend on others. For example, the lemmatizer needs POS tags to distinguish “meeting” (noun) from “meeting” (verb).

Built-in Pipeline Components¶

SpaCy’s trained pipelines include several standard components. Here’s what each one does:

Component	Description	Creates
`tok2vec`	Shared token-to-vector embeddings	Internal vectors for other components
`tagger`	Part-of-speech tagging	`Token.pos_`, `Token.tag_`
`parser`	Dependency parsing	`Token.dep_`, `Token.head`, `Doc.sents`
`attribute_ruler`	Rule-based attribute assignment	Various token attributes
`lemmatizer`	Base form assignment	`Token.lemma_`
`ner`	Named entity recognition	`Doc.ents`, `Token.ent_type_`

Let’s see these in action:

doc = nlp("Microsoft announced quarterly earnings in Seattle.")

print("=== Token Annotations ===")
print(f"{'Token':<12} {'POS':<6} {'Tag':<6} {'Dep':<10} {'Head':<12} {'Lemma':<12}")
print("-" * 60)
for token in doc:
    print(f"{token.text:<12} {token.pos_:<6} {token.tag_:<6} {token.dep_:<10} {token.head.text:<12} {token.lemma_:<12}")

=== Token Annotations ===
Token        POS    Tag    Dep        Head         Lemma       
------------------------------------------------------------
Microsoft    PROPN  NNP    nsubj      announced    Microsoft   
announced    VERB   VBD    ROOT       announced    announce    
quarterly    ADJ    JJ     amod       earnings     quarterly   
earnings     NOUN   NNS    dobj       announced    earning     
in           ADP    IN     prep       earnings     in          
Seattle      PROPN  NNP    pobj       in           Seattle     
.            PUNCT  .      punct      announced    .

print("\n=== Named Entities ===")
for ent in doc.ents:
    print(f"{ent.text:<20} -> {ent.label_:<10} ({spacy.explain(ent.label_)})")


=== Named Entities ===
Microsoft            -> ORG        (Companies, agencies, institutions, etc.)
quarterly            -> DATE       (Absolute or relative dates or periods)
Seattle              -> GPE        (Countries, cities, states)

print("\n=== Sentences ===")
for i, sent in enumerate(doc.sents):
    print(f"Sentence {i}: {sent.text}")


=== Sentences ===
Sentence 0: Microsoft announced quarterly earnings in Seattle.

Which Component Produces What?¶

A common source of confusion: if you get an error about missing attributes, it usually means a required component isn’t in your pipeline.

# Let's trace which component produces which attribute
# by checking what a blank pipeline gives us

nlp_blank = spacy.blank("en")
doc_blank = nlp_blank("Apple is a company.")

print("Blank pipeline - no components:")
print(f"  pipe_names: {nlp_blank.pipe_names}")
print(f"  Token POS available: {doc_blank[0].pos_}")  # Empty string - no tagger!
print(f"  Entities: {list(doc_blank.ents)}")  # Empty - no NER!

Blank pipeline - no components:
  pipe_names: []
  Token POS available: 
  Entities: []

Exercise 3.1: Component Detective

Given this code that fails:

nlp = spacy.blank("en")
doc = nlp("The quick brown fox jumps over the lazy dog.")
for sent in doc.sents:  # This will fail!
    print(sent)

Why does it fail?
Which component would you need to add?
What’s the minimum pipeline needed to get sentence boundaries?

Hint: Check the table above to see which component creates Doc.sents.

Inspecting and Modifying the Pipeline¶

SpaCy gives you full control over your pipeline. You can inspect it, disable components, or remove them entirely.

Inspecting Components¶

# Detailed pipeline info
print(f"Pipeline names: {nlp.pipe_names}")
print(f"Number of components: {len(nlp.pipe_names)}")

# Check if a specific component exists
print(f"\nHas 'ner': {'ner' in nlp.pipe_names}")
print(f"Has 'textcat': {'textcat' in nlp.pipe_names}")

Pipeline names: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Number of components: 6

Has 'ner': True
Has 'textcat': False

Disabling Components Temporarily¶

Sometimes you only need certain annotations. Running the full pipeline wastes time. Use nlp.select_pipes() to temporarily disable components:

import time

text = "Apple Inc. reported strong earnings. The CEO Tim Cook announced new products."

# Full pipeline
start = time.perf_counter()
for _ in range(100):
    doc = nlp(text)
full_time = time.perf_counter() - start

# Only tokenization and NER
start = time.perf_counter()
with nlp.select_pipes(enable=["ner"]):
    for _ in range(100):
        doc = nlp(text)
partial_time = time.perf_counter() - start

print(f"Full pipeline: {full_time:.3f}s")
print(f"NER only:      {partial_time:.3f}s")
print(f"Speedup:       {full_time/partial_time:.1f}x faster")

Full pipeline: 0.491s
NER only:      0.204s
Speedup:       2.4x faster

# You can also disable specific components
with nlp.select_pipes(disable=["parser", "attribute_ruler"]):
    doc = nlp("Testing with fewer components.")
    print(f"Active components: {nlp.pipe_names}")
    # Note: lemmatizer may still work but with reduced accuracy

Active components: ['tok2vec', 'tagger', 'lemmatizer', 'ner']

/home/runner/work/ucf-cap-6640-book/ucf-cap-6640-book/.venv/lib/python3.12/site-packages/spacy/pipeline/lemmatizer.py:188: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)

When to Disable Components¶

Scenario	Recommended Approach
Only need tokenization	Use `nlp.make_doc(text)` instead
Only need NER	`enable=["ner"]`
Only need POS tags	`enable=["tagger"]`
Processing millions of docs	Disable everything you don’t need

Exercise 3.2: Speed Optimization

You’re building a system that extracts company names from 100,000 news articles. You only need named entities, nothing else.

Write code that processes a sample text with only the necessary components enabled
Verify that you can still access doc.ents
Verify that doc.sents raises an appropriate error or returns empty

nlp = spacy.load("en_core_web_sm")
sample = "Google and Microsoft announced a partnership in New York."

# Your code here - process with minimal components

Creating Custom Pipeline Components¶

The real power of SpaCy’s pipeline architecture is that you can add your own components. Custom components let you:

Run custom logic automatically when processing text
Add metadata to documents and tokens
Integrate external tools into the SpaCy workflow

Basic Component Structure¶

A pipeline component is a function that takes a Doc and returns a Doc:

from spacy.language import Language

# Register the component with a name
@Language.component("doc_length_logger")
def doc_length_logger(doc):
    """Log the document length."""
    print(f"Processing document with {len(doc)} tokens")
    return doc  # Always return the doc!

# Add to pipeline
nlp_custom = spacy.load("en_core_web_sm")
nlp_custom.add_pipe("doc_length_logger", first=True)  # Add at the beginning

print("Pipeline:", nlp_custom.pipe_names)

Pipeline: ['doc_length_logger', 'tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

# Now it runs automatically
doc = nlp_custom("This is a test sentence.")

Processing document with 6 tokens

Controlling Component Position¶

Where you add a component matters:

# Different positioning options
nlp_demo = spacy.load("en_core_web_sm")

@Language.component("position_demo")
def position_demo(doc):
    return doc

# Add at specific positions
# nlp_demo.add_pipe("position_demo", first=True)      # At the beginning
# nlp_demo.add_pipe("position_demo", last=True)       # At the end (default)
# nlp_demo.add_pipe("position_demo", before="ner")    # Before NER
# nlp_demo.add_pipe("position_demo", after="tagger")  # After tagger

nlp_demo.add_pipe("position_demo", after="ner")
print("Pipeline order:", nlp_demo.pipe_names)

Pipeline order: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'position_demo']

A Practical Component: Loaded Language Detector¶

Let’s build something useful — a component that detects loaded language in news articles. Loaded language uses words that carry strong emotional connotations or implicit judgments, potentially revealing bias in reporting.

Consider the difference between:

“The senator said...” (neutral)
“The senator claimed...” (implies doubt)
“The senator admitted...” (implies guilt)

@Language.component("loaded_language_detector")
def loaded_language_detector(doc):
    """Detect loaded/biased language in text."""
    # Words that imply doubt or skepticism (use LEMMA forms!)
    doubt_markers = {"claim", "allege", "purport", "supposedly"}  # "claimed" -> "claim"

    # Words that imply guilt or wrongdoing (LEMMA forms)
    guilt_markers = {"admit", "confess", "concede"}  # "admitted" -> "admit"

    # Emotionally charged descriptors (LEMMA forms)
    charged_words = {"radical", "extremist", "regime", "slam", "blast",
                     "destroy", "crush", "controversial", "embattled"}  # "slammed" -> "slam"

    # Some words need TEXT matching (hyphenated, don't lemmatize well)
    text_markers = {"so-called"}

    loaded_tokens = []
    for token in doc:
        lemma = token.lemma_.lower()
        text = token.text.lower()

        # Check lemma-based markers
        if lemma in doubt_markers:
            loaded_tokens.append((token.text, token.i, "DOUBT"))
        elif lemma in guilt_markers:
            loaded_tokens.append((token.text, token.i, "GUILT"))
        elif lemma in charged_words:
            loaded_tokens.append((token.text, token.i, "CHARGED"))
        # Check text-based markers (for hyphenated words, etc.)
        elif text in text_markers:
            loaded_tokens.append((token.text, token.i, "DOUBT"))

    if loaded_tokens:
        print(f"⚠️  Loaded language detected: {[(t[0], t[2]) for t in loaded_tokens]}")
    return doc

nlp_bias = spacy.load("en_core_web_sm")
nlp_bias.add_pipe("loaded_language_detector", last=True)

# Test with different phrasings of similar content
examples = [
    "The CEO announced the quarterly results.",
    "The CEO claimed the company was profitable.",
    "The embattled CEO admitted to the accounting errors.",
    "Critics slammed the controversial policy as radical.",
]

for text in examples:
    print(f"\n'{text}'")
    doc = nlp_bias(text)


'The CEO announced the quarterly results.'

'The CEO claimed the company was profitable.'
⚠️  Loaded language detected: [('claimed', 'DOUBT')]

'The embattled CEO admitted to the accounting errors.'
⚠️  Loaded language detected: [('embattled', 'CHARGED'), ('admitted', 'GUILT')]

'Critics slammed the controversial policy as radical.'
⚠️  Loaded language detected: [('slammed', 'CHARGED'), ('controversial', 'CHARGED'), ('radical', 'CHARGED')]

A Complex Example: Source Attribution Analyzer¶

In journalism, who is cited and how they’re introduced matters enormously. Let’s build a sophisticated component that extracts source attributions — phrases like “According to experts”, “Officials said”, or “Sources familiar with the matter claim”.

This component will use SpaCy’s Matcher to find patterns that indicate attribution, then analyze the language used.

from spacy.matcher import Matcher
from spacy.tokens import Span

# Attribution patterns we want to detect
# These capture common ways journalists attribute information

@Language.factory("source_attribution_analyzer")
def create_attribution_analyzer(nlp, name):
    """Factory that creates an attribution analyzer with pattern matching."""
    matcher = Matcher(nlp.vocab)

    # Pattern: "According to [ENTITY/noun phrase]"
    matcher.add("ACCORDING_TO", [
        [{"LOWER": "according"}, {"LOWER": "to"}, {"POS": {"IN": ["PROPN", "NOUN"]}, "OP": "+"}]
    ])

    # Pattern: "[Someone] said/stated/claimed/argued"
    matcher.add("SPEECH_VERB", [
        [{"POS": "PROPN", "OP": "+"}, {"LEMMA": {"IN": ["say", "state", "claim", "argue", "assert", "contend", "insist"]}}],
        [{"POS": "NOUN", "OP": "+"}, {"LEMMA": {"IN": ["say", "state", "claim", "argue", "assert", "contend", "insist"]}}]
    ])

    # Pattern: "Sources [familiar with / close to] ... said"
    matcher.add("ANONYMOUS_SOURCE", [
        [{"LOWER": "sources"}, {"OP": "*", "IS_ALPHA": True}, {"LEMMA": {"IN": ["say", "claim", "report", "indicate"]}}]
    ])

    # Pattern: "[officials/experts/analysts] [verb]"
    matcher.add("EXPERT_CITE", [
        [{"LOWER": {"IN": ["officials", "experts", "analysts", "researchers", "scientists", "observers"]}},
         {"LEMMA": {"IN": ["say", "believe", "warn", "suggest", "note", "argue"]}}]
    ])

    def attribution_analyzer(doc):
        matches = matcher(doc)
        attributions = []

        for match_id, start, end in matches:
            pattern_name = nlp.vocab.strings[match_id]
            span = doc[start:end]
            attributions.append({
                "text": span.text,
                "type": pattern_name,
                "start": start,
                "end": end
            })

        # Store for later use (we'll add proper extensions soon)
        if attributions:
            print(f"📰 Found {len(attributions)} attribution(s):")
            for attr in attributions:
                print(f"   [{attr['type']}] \"{attr['text']}\"")

        return doc

    return attribution_analyzer

# Create the pipeline
nlp_news = spacy.load("en_core_web_sm")
nlp_news.add_pipe("source_attribution_analyzer", last=True)
print("Pipeline:", nlp_news.pipe_names)

Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'source_attribution_analyzer']

# Test with real news-style sentences
news_examples = [
    "According to White House officials, the policy will take effect immediately.",
    "Critics claimed the proposal was too aggressive.",
    "Sources familiar with the matter said negotiations had stalled.",
    "Dr. Smith stated that the results were preliminary.",
    "Experts warn that climate change poses significant risks.",
    "The company announced record profits yesterday.",  # No attribution - direct statement
]

print("=" * 60)
for text in news_examples:
    print(f"\n\"{text}\"")
    doc = nlp_news(text)

============================================================

"According to White House officials, the policy will take effect immediately."
📰 Found 3 attribution(s):
   [ACCORDING_TO] "According to White"
   [ACCORDING_TO] "According to White House"
   [ACCORDING_TO] "According to White House officials"

"Critics claimed the proposal was too aggressive."
📰 Found 1 attribution(s):
   [SPEECH_VERB] "Critics claimed"

"Sources familiar with the matter said negotiations had stalled."
📰 Found 2 attribution(s):
   [SPEECH_VERB] "matter said"
   [ANONYMOUS_SOURCE] "Sources familiar with the matter said"

"Dr. Smith stated that the results were preliminary."
📰 Found 2 attribution(s):
   [SPEECH_VERB] "Smith stated"
   [SPEECH_VERB] "Dr. Smith stated"

"Experts warn that climate change poses significant risks."
📰 Found 1 attribution(s):
   [EXPERT_CITE] "Experts warn"

"The company announced record profits yesterday."

Notice how the last example has no attribution — it’s presented as direct fact. In media analysis, the absence of attribution can be just as significant as its presence.

Exercise 3.3: Extend the Attribution Analyzer

The current analyzer misses some common patterns. Extend it to detect:

Hedged claims: “It is believed that...”, “It appears that...”, “reportedly”
Document citations: “The report states...”, “According to the study...”
Social media citations: “tweeted”, “posted on X”

# Add these patterns to the matcher:
# 1. Hedged claims pattern
# 2. Document citation pattern
# 3. Social media pattern

# Test with:
test_sentences = [
    "It is believed that the company will merge next quarter.",
    "The report states that emissions have increased.",
    "The senator tweeted that she would oppose the bill.",
]

Extension Attributes: Custom Metadata¶

Sometimes you need to store custom data on documents, tokens, or spans. SpaCy’s extension attributes let you attach arbitrary metadata using the ._ namespace.

Setting Up Extensions¶

There are three types of extensions. The most commonly used is property extensions with getter functions:

from spacy.tokens import Doc, Token, Span

# Property extension with a getter function
def get_is_hedge_word(token):
    """Check if a token is a hedge word (indicates uncertainty)."""
    hedge_words = {"may", "might", "could", "possibly", "potentially",
                   "perhaps", "likely", "unlikely", "appears", "seems",
                   "suggests", "reportedly", "allegedly", "purportedly"}
    return token.lemma_.lower() in hedge_words

# Register the extension
Token.set_extension("is_hedge", getter=get_is_hedge_word, force=True)

doc = nlp("The policy may potentially affect millions, experts suggest.")
print("Hedge words found:")
for token in doc:
    if token._.is_hedge:
        print(f"  '{token.text}' at position {token.i}")

Hedge words found:
  'may' at position 2
  'potentially' at position 3

# Property extension on Span - check hedging in a sentence
def span_hedge_count(span):
    return sum(1 for token in span if token._.is_hedge)

Span.set_extension("hedge_count", getter=span_hedge_count, force=True)

doc = nlp("The results are conclusive. However, they may possibly change. Time will tell.")
for sent in doc.sents:
    hedges = sent._.hedge_count
    certainty = "uncertain" if hedges > 0 else "direct"
    print(f"[{certainty}] ({hedges} hedges) '{sent.text}'")

[direct] (0 hedges) 'The results are conclusive.'
[uncertain] (2 hedges) 'However, they may possibly change.'
[direct] (0 hedges) 'Time will tell.'

Other Extension Types¶

While property extensions (with getters) are most common, SpaCy also supports:

Attribute extensions — simple default values you can overwrite:

# Attribute extension - stores a value directly
Doc.set_extension("news_source", default=None, force=True)
Doc.set_extension("publish_date", default=None, force=True)
Doc.set_extension("bias_rating", default=None, force=True)

doc = nlp("The controversial bill passed despite opposition.")
doc._.news_source = "Reuters"
doc._.publish_date = "2025-01-20"
doc._.bias_rating = "center"

print(f"Source: {doc._.news_source}")
print(f"Date: {doc._.publish_date}")
print(f"Bias: {doc._.bias_rating}")

Source: Reuters
Date: 2025-01-20
Bias: center

Method extensions — callable functions with arguments:

# Method extension - can take arguments
def count_word_category(doc, category):
    """Count words from a specific category."""
    categories = {
        "attribution": {"said", "stated", "claimed", "argued", "noted"},
        "hedging": {"may", "might", "could", "possibly", "perhaps"},
        "intensifiers": {"very", "extremely", "absolutely", "totally"}
    }
    words = categories.get(category, set())
    return sum(1 for token in doc if token.lemma_.lower() in words)

Doc.set_extension("count_category", method=count_word_category, force=True)

doc = nlp("Officials said the very controversial policy may possibly be revised.")
print(f"Attribution words: {doc._.count_category('attribution')}")
print(f"Hedging words: {doc._.count_category('hedging')}")
print(f"Intensifiers: {doc._.count_category('intensifiers')}")

Attribution words: 0
Hedging words: 2
Intensifiers: 1

Combining Components with Extensions¶

The real power comes from using extensions inside custom components. Let’s build an objectivity analyzer that scores how “objective” or “opinionated” a piece of text appears:

# Register extensions for our objectivity analyzer
Doc.set_extension("loaded_words", default=[], force=True)
Doc.set_extension("hedge_words_found", default=[], force=True)
Doc.set_extension("objectivity_score", getter=lambda doc:
    max(0, 100 - (len(doc._.loaded_words) * 15) - (len(doc._.hedge_words_found) * 5)),
    force=True
)

@Language.component("objectivity_analyzer")
def objectivity_analyzer(doc):
    """Analyze text for markers of bias and hedging."""
    # Loaded/biased language - use LEMMA forms! (reduces objectivity significantly)
    loaded = {"claim", "admit", "radical", "extremist", "slam",
              "blast", "controversial", "embattled", "regime"}  # lemmas

    # Hedge words - use LEMMA forms! (slightly reduces objectivity)
    hedges = {"may", "might", "could", "possibly", "perhaps", "allegedly",
              "reportedly", "appear", "seem", "suggest"}  # "appears" -> "appear"

    found_loaded = []
    found_hedges = []

    for token in doc:
        lemma = token.lemma_.lower()
        if lemma in loaded:
            found_loaded.append(token.text)
        elif lemma in hedges:
            found_hedges.append(token.text)

    doc._.loaded_words = found_loaded
    doc._.hedge_words_found = found_hedges
    return doc

nlp_obj = spacy.load("en_core_web_sm")
nlp_obj.add_pipe("objectivity_analyzer", last=True)

# Test with articles of varying objectivity
articles = [
    # Relatively objective
    "The company reported quarterly earnings of $2.5 billion, exceeding analyst expectations.",
    # Some hedging
    "The policy may possibly affect healthcare costs, according to preliminary estimates.",
    # Loaded language
    "The radical proposal was slammed by critics as controversial and extreme.",
    # Very loaded
    "The embattled CEO admitted the so-called innovation was a failure after critics blasted the controversial decision.",
]

print("Objectivity Analysis")
print("=" * 70)
for text in articles:
    doc = nlp_obj(text)
    print(f"\nText: {text[:60]}...")
    print(f"  Loaded words: {doc._.loaded_words}")
    print(f"  Hedge words: {doc._.hedge_words_found}")
    print(f"  Objectivity score: {doc._.objectivity_score}/100")

Objectivity Analysis
======================================================================

Text: The company reported quarterly earnings of $2.5 billion, exc...
  Loaded words: []
  Hedge words: []
  Objectivity score: 100/100

Text: The policy may possibly affect healthcare costs, according t...
  Loaded words: []
  Hedge words: ['may', 'possibly']
  Objectivity score: 90/100

Text: The radical proposal was slammed by critics as controversial...
  Loaded words: ['radical', 'slammed', 'controversial']
  Hedge words: []
  Objectivity score: 55/100

Text: The embattled CEO admitted the so-called innovation was a fa...
  Loaded words: ['embattled', 'admitted', 'blasted', 'controversial']
  Hedge words: []
  Objectivity score: 40/100

Exercise 3.4: Build a Framing Detector

Create extensions that detect how entities are framed in text. For each named entity, track:

What verbs are associated with it (what actions does it take/receive?)
What adjectives modify it?
Is it the subject or object of sentences?

# Create extensions:
# - Span.set_extension("associated_verbs", ...)
# - Span.set_extension("modifying_adjectives", ...)
# - Span.set_extension("grammatical_role", ...)

# Test with:
test_text = """
The innovative company announced groundbreaking results.
Critics attacked the controversial firm.
The struggling business admitted to errors.
"""

# For each entity, print its framing

This reveals how the same entity can be portrayed positively or negatively through word choice.

Scaling Up: Processing Large Volumes¶

When processing thousands or millions of documents, efficiency matters. SpaCy’s nlp.pipe() method processes texts in batches, which is much faster than calling nlp() on each text individually.

The Wrong Way vs. The Right Way¶

texts = [
    "Apple announced new products.",
    "Google released an AI update.",
    "Microsoft acquired a startup.",
    "Amazon expanded cloud services.",
    "Meta launched new features."
] * 100  # 500 texts

# SLOW: Processing one at a time
start = time.perf_counter()
docs_slow = [nlp(text) for text in texts]
slow_time = time.perf_counter() - start

# FAST: Using nlp.pipe()
start = time.perf_counter()
docs_fast = list(nlp.pipe(texts))
fast_time = time.perf_counter() - start

print(f"One at a time: {slow_time:.3f}s")
print(f"Using pipe():  {fast_time:.3f}s")
print(f"Speedup:       {slow_time/fast_time:.1f}x faster")

One at a time: 1.714s
Using pipe():  0.395s
Speedup:       4.3x faster

Passing Context with `as_tuples`¶

Often you need to track metadata alongside your documents:

# Documents with metadata
data = [
    ("Apple stock rose 5% today.", {"source": "Reuters", "date": "2025-01-20"}),
    ("New iPhone features announced.", {"source": "TechCrunch", "date": "2025-01-19"}),
    ("Tim Cook speaks at conference.", {"source": "Bloomberg", "date": "2025-01-18"}),
]

# Process while keeping context
Doc.set_extension("source", default=None, force=True)
Doc.set_extension("date", default=None, force=True)

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.source = context["source"]
    doc._.date = context["date"]

    entities = [ent.text for ent in doc.ents]
    print(f"[{doc._.source}] {doc._.date}: {entities}")

[Reuters] 2025-01-20: ['Apple', '5%', 'today']
[TechCrunch] 2025-01-19: []
[Bloomberg] 2025-01-18: ['Tim Cook']

Combining Optimizations¶

For maximum speed, combine nlp.pipe() with disabled components:

large_texts = ["Sample text about technology companies."] * 1000

# Maximum optimization: batch processing + minimal pipeline
start = time.perf_counter()
with nlp.select_pipes(enable=["ner"]):
    docs = list(nlp.pipe(large_texts, batch_size=50))
optimized_time = time.perf_counter() - start

print(f"Processed {len(docs)} documents in {optimized_time:.3f}s")
print(f"Rate: {len(docs)/optimized_time:.0f} docs/second")

Processed 1000 documents in 0.452s
Rate: 2213 docs/second

Putting It All Together¶

Let’s combine everything we’ve learned into a complete Media Bias Analyzer — a custom pipeline that processes news articles and provides a comprehensive bias analysis.

from spacy.matcher import Matcher

# Complete custom pipeline for media bias analysis

# 1. Set up all extensions
Doc.set_extension("bias_indicators", default=[], force=True)
Doc.set_extension("attribution_count", default=0, force=True)
Doc.set_extension("anonymous_sources", default=0, force=True)
Doc.set_extension("bias_score", getter=lambda doc:
    min(100, len(doc._.bias_indicators) * 20 + doc._.anonymous_sources * 10),
    force=True
)
Doc.set_extension("analysis_summary", getter=lambda doc:
    f"Bias score: {doc._.bias_score}/100 | "
    f"{len(doc._.bias_indicators)} loaded terms | "
    f"{doc._.attribution_count} attributions ({doc._.anonymous_sources} anonymous)",
    force=True
)

# 2. Create comprehensive analyzer component
@Language.factory("media_bias_analyzer")
def create_media_bias_analyzer(nlp, name):
    """Complete media bias analysis component."""

    # Loaded language categories - use LEMMA forms!
    bias_lexicon = {
        "doubt": {"claim", "allege", "purport"},  # "claimed" -> "claim"
        "guilt": {"admit", "confess", "concede"},  # "admitted" -> "admit"
        "charged": {"radical", "extremist", "regime", "slam", "blast",
                   "controversial", "embattled", "disgrace"},  # "slammed" -> "slam"
        "praise": {"praise", "hail", "celebrate", "laud", "acclaim"}  # "praised" -> "praise"
    }

    # Set up matcher for attribution patterns
    matcher = Matcher(nlp.vocab)
    matcher.add("ANONYMOUS", [
        [{"LOWER": "sources"}, {"OP": "*"}, {"LEMMA": {"IN": ["say", "claim", "indicate"]}}],
        [{"LOWER": "according"}, {"LOWER": "to"}, {"LOWER": {"IN": ["sources", "officials"]}}]
    ])
    matcher.add("ATTRIBUTION", [
        [{"POS": "PROPN", "OP": "+"}, {"LEMMA": {"IN": ["say", "state", "claim", "argue"]}}],
        [{"LOWER": "according"}, {"LOWER": "to"}, {"POS": "PROPN", "OP": "+"}]
    ])

    def media_bias_analyzer(doc):
        # Find loaded language
        indicators = []
        for token in doc:
            lemma = token.lemma_.lower()
            for category, words in bias_lexicon.items():
                if lemma in words:
                    indicators.append({
                        "word": token.text,
                        "category": category,
                        "position": token.i
                    })

        # Find attribution patterns
        matches = matcher(doc)
        attribution_count = 0
        anonymous_count = 0

        for match_id, start, end in matches:
            pattern_name = nlp.vocab.strings[match_id]
            attribution_count += 1
            if pattern_name == "ANONYMOUS":
                anonymous_count += 1

        # Store results
        doc._.bias_indicators = indicators
        doc._.attribution_count = attribution_count
        doc._.anonymous_sources = anonymous_count

        return doc

    return media_bias_analyzer

# 3. Build the complete pipeline
nlp_analyzer = spacy.load("en_core_web_sm")
nlp_analyzer.add_pipe("media_bias_analyzer", last=True)
print("Media Bias Pipeline:", nlp_analyzer.pipe_names)

Media Bias Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'media_bias_analyzer']

# 4. Analyze a collection of news articles
news_corpus = [
    # Relatively neutral reporting
    ("The Federal Reserve announced a 0.25% interest rate increase on Wednesday. "
     "Fed Chair Powell stated that inflation remains a concern.",
     {"source": "Reuters", "topic": "Economy"}),

    # Some bias markers
    ("Critics claimed the controversial policy would harm small businesses. "
     "Sources familiar with the negotiations said talks had stalled.",
     {"source": "Unknown", "topic": "Policy"}),

    # Heavy bias
    ("The embattled senator admitted to the so-called ethics violations after "
     "opponents slammed the radical proposal. Sources say more revelations are coming.",
     {"source": "Partisan News", "topic": "Politics"}),

    # Positive bias
    ("The acclaimed CEO was praised for the groundbreaking innovation. "
     "Industry experts hailed the announcement as transformative.",
     {"source": "Industry Mag", "topic": "Business"}),
]

print("=" * 70)
print("MEDIA BIAS ANALYSIS REPORT")
print("=" * 70)

for doc, meta in nlp_analyzer.pipe(news_corpus, as_tuples=True):
    print(f"\n📰 Source: {meta['source']} | Topic: {meta['topic']}")
    print(f"   Text: \"{doc.text[:70]}...\"")
    print(f"   {doc._.analysis_summary}")

    if doc._.bias_indicators:
        print(f"   Loaded terms: {[(i['word'], i['category']) for i in doc._.bias_indicators]}")

    # Rating based on score
    score = doc._.bias_score
    if score < 20:
        rating = "✅ Low bias"
    elif score < 50:
        rating = "⚠️  Moderate bias"
    else:
        rating = "🚨 High bias"
    print(f"   Rating: {rating}")
    print("-" * 70)

======================================================================
MEDIA BIAS ANALYSIS REPORT
======================================================================

📰 Source: Reuters | Topic: Economy
   Text: "The Federal Reserve announced a 0.25% interest rate increase on Wednes..."
   Bias score: 0/100 | 0 loaded terms | 3 attributions (0 anonymous)
   Rating: ✅ Low bias
----------------------------------------------------------------------

📰 Source: Unknown | Topic: Policy
   Text: "Critics claimed the controversial policy would harm small businesses. ..."
   Bias score: 50/100 | 2 loaded terms | 1 attributions (1 anonymous)
   Loaded terms: [('claimed', 'doubt'), ('controversial', 'charged')]
   Rating: 🚨 High bias
----------------------------------------------------------------------

📰 Source: Partisan News | Topic: Politics
   Text: "The embattled senator admitted to the so-called ethics violations afte..."
   Bias score: 70/100 | 3 loaded terms | 1 attributions (1 anonymous)
   Loaded terms: [('admitted', 'guilt'), ('slammed', 'charged'), ('radical', 'charged')]
   Rating: 🚨 High bias
----------------------------------------------------------------------

📰 Source: Industry Mag | Topic: Business
   Text: "The acclaimed CEO was praised for the groundbreaking innovation. Indus..."
   Bias score: 40/100 | 2 loaded terms | 0 attributions (0 anonymous)
   Loaded terms: [('praised', 'praise'), ('hailed', 'praise')]
   Rating: ⚠️  Moderate bias
----------------------------------------------------------------------

This pipeline demonstrates the full power of SpaCy’s architecture:

Multiple detection methods: lexicon matching + pattern matching
Quantified output: numeric scores for comparison
Rich metadata: detailed breakdown of bias indicators
Batch processing: efficient analysis of document collections

Wrap-Up¶

Key Takeaways¶

Pipeline Design Checklist¶

When building custom pipelines, consider:

What annotations do I actually need?
Can I disable any built-in components?
Where should my custom component go in the pipeline?
What data should I store using extensions?
Am I processing in batches for efficiency?

What’s Next¶

In Week 3, we’ll move from processing individual tokens to representing entire documents as vectors. We’ll explore text representation: bag of words, TF-IDF, and word embeddings — the foundation for machine learning on text.