Lab: Domain Detective
CAP-6640: Computational Understanding of Natural Language
Spencer Lyon
Prerequisites
L01.01: What is NLP?
L01.02: The Evolution and Practice of NLP
Working SpaCy environment
Outcomes
Apply SpaCy to analyze text from multiple domains
Identify where NLP tools succeed and fail across different text types
Discover domain-specific challenges in entity recognition and parsing
Build intuition for how text structure varies across domains
Practice collaborative problem-solving in small groups
Time: 40-60 minutes
Overview¶
In this lab, you’ll work in small groups (2-3 people) to investigate how SpaCy performs across different domains. Real-world NLP isn’t just about processing clean, well-formed text — it’s about handling the messy diversity of human communication.
You’ll analyze four text samples from very different domains:
News articles
Clinical/medical notes
Social media
Legal documents
Your mission: discover what works, what breaks, and why.
Setup¶
import spacy
from spacy import displacy
from collections import Counter, defaultdict
# Load the model
nlp = spacy.load("en_core_web_sm")
# Our text samples from different domains
texts = {
"news": """
Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
headquarters in Brussels next month to discuss privacy regulations. The tech
giant has been at odds with EU regulators over the Digital Markets Act, which
requires major platforms to allow third-party app stores. Apple's stock rose
2.3% on the news, closing at $187.50.
""",
"clinical": """
Pt is a 67 y/o M presenting w/ SOB x 3 days. PMHx significant for HTN, DM2,
and CHF (EF 35%). Currently on metformin 1000mg BID, lisinopril 20mg daily,
and Lasix 40mg PRN. Vitals: BP 145/92, HR 88, RR 22, SpO2 94% on RA. Lungs
with bilateral crackles. Assessment: CHF exacerbation. Plan: IV Lasix,
cardiology consult, repeat echo.
""",
"social": """
just tried the new chatgpt update and WOW 🤯 it literally wrote my entire
essay in like 5 mins lmaooo. @openai really outdid themselves this time ngl.
anyone else think AI is gonna replace writers soon?? kinda scary tbh but also
lowkey excited for the future 🚀 #AI #ChatGPT #TheFutureIsNow
""",
"legal": """
WHEREAS, the Licensor owns certain intellectual property rights in and to
the Software (as defined herein), and WHEREAS, the Licensee desires to obtain
a non-exclusive, non-transferable license to use the Software subject to the
terms and conditions set forth in this Agreement; NOW, THEREFORE, in consideration
of the mutual covenants contained herein and for other good and valuable
consideration, the receipt and sufficiency of which are hereby acknowledged,
the parties agree as follows.
"""
}
# Helper function to process and display basic info
def analyze_text(text, domain_name):
"""Process text and return basic analysis."""
doc = nlp(text.strip())
print(f"\n{'='*60}")
print(f"DOMAIN: {domain_name.upper()}")
print(f"{'='*60}")
print(f"Tokens: {len(doc)}")
print(f"Sentences: {len(list(doc.sents))}")
print(f"Entities: {len(doc.ents)}")
return docTask 1: Entity Extraction Challenge (15 min)¶
Your first task is to run NER across all four domains and critically evaluate the results.
1.1 Extract and Compare Entities¶
def extract_entities(doc):
"""Extract entities grouped by label."""
entities_by_type = defaultdict(list)
for ent in doc.ents:
entities_by_type[ent.label_].append(ent.text)
return dict(entities_by_type)
# Process all texts
docs = {}
for domain, text in texts.items():
docs[domain] = analyze_text(text, domain)
============================================================
DOMAIN: NEWS
============================================================
Tokens: 70
Sentences: 3
Entities: 11
============================================================
DOMAIN: CLINICAL
============================================================
Tokens: 90
Sentences: 7
Entities: 13
============================================================
DOMAIN: SOCIAL
============================================================
Tokens: 61
Sentences: 4
Entities: 5
============================================================
DOMAIN: LEGAL
============================================================
Tokens: 95
Sentences: 1
Entities: 4
# Show entities for each domain
for domain, doc in docs.items():
print(f"\n--- {domain.upper()} ENTITIES ---")
entities = extract_entities(doc)
for label, ents in sorted(entities.items()):
print(f" {label}: {ents}")
--- NEWS ENTITIES ---
DATE: ['Tuesday', 'next month']
GPE: ['Brussels']
MONEY: ['187.50']
ORDINAL: ['third']
ORG: ['Apple Inc.', 'the European Union', 'EU', 'Apple']
PERCENT: ['2.3%']
PERSON: ['Tim Cook']
--- CLINICAL ENTITIES ---
CARDINAL: ['67', '1000', '20']
DATE: ['3 days', '145/92, HR 88']
ORG: ['SOB', 'CHF', 'BID', 'RA']
PERCENT: ['35%', '94%']
PERSON: ['Lasix 40', 'IV Lasix']
--- SOCIAL ENTITIES ---
CARDINAL: ['5']
GPE: ['AI']
MONEY: ['🚀 #AI #ChatGPT #']
ORG: ['WOW']
PERSON: ['kinda']
--- LEGAL ENTITIES ---
DATE: ['Licensor']
ORG: ['Software', 'Software']
PERSON: ['Licensee']
1.2 Discussion Questions¶
Work with your group to answer these questions:
Which domain has the most named entity recognition results? Is this what you expected?
Find at least 2 entities that were correctly identified. What made them easy for SpaCy?
Find at least 2 entities that were missed or incorrectly labeled. What made them hard?
Look at the clinical text specifically. What named entity recognition results should have been found but weren’t? Why might a general-purpose model struggle here?
# Space for your exploration
# Try visualizing entities for a specific domain:
# displacy.render(docs["clinical"], style="ent", jupyter=True)Task 2: Ambiguity Hunt (15 min)¶
In this task, you’ll find sentences that are ambiguous or challenging, predict how SpaCy will handle them, and then test your predictions.
2.1 Find Ambiguous Sentences¶
Look through the texts (or create your own examples) to find sentences with:
Unclear pronoun references
Words that could have multiple meanings
Unusual structure or formatting
2.2 Test and Analyze¶
def analyze_sentence(sentence):
"""Analyze a single sentence in detail."""
doc = nlp(sentence)
print(f"Sentence: {sentence}")
print(f"\nTokens and Dependencies:")
for token in doc:
print(f" {token.text:15} | POS: {token.pos_:6} | DEP: {token.dep_:12} | HEAD: {token.head.text}")
print(f"\nEntities:")
if doc.ents:
for ent in doc.ents:
print(f" {ent.text:20} -> {ent.label_}")
else:
print(" (none detected)")
return doc
# Example: Test an ambiguous sentence from the social media text
test_sentence = "anyone else think AI is gonna replace writers soon??"
analyze_sentence(test_sentence)Sentence: anyone else think AI is gonna replace writers soon??
Tokens and Dependencies:
anyone | POS: PRON | DEP: nsubj | HEAD: think
else | POS: ADV | DEP: advmod | HEAD: anyone
think | POS: VERB | DEP: ROOT | HEAD: think
AI | POS: PROPN | DEP: nsubj | HEAD: gon
is | POS: AUX | DEP: aux | HEAD: gon
gon | POS: VERB | DEP: ccomp | HEAD: think
na | POS: PART | DEP: aux | HEAD: replace
replace | POS: VERB | DEP: xcomp | HEAD: gon
writers | POS: NOUN | DEP: dobj | HEAD: replace
soon | POS: ADV | DEP: advmod | HEAD: replace
? | POS: PUNCT | DEP: punct | HEAD: think
? | POS: PUNCT | DEP: punct | HEAD: think
Entities:
AI -> GPE
anyone else think AI is gonna replace writers soon??2.3 Your Turn¶
Test at least 3 challenging sentences. For each one:
Predict what SpaCy will do (before running)
Test your prediction
Explain any surprises
# Example sentences to try (or use your own):
challenge_sentences = [
# From clinical: abbreviations
"Pt is a 67 y/o M presenting w/ SOB.",
# Pronoun ambiguity
"Apple's stock rose because it announced new privacy features.",
# Social media style
"@openai really outdid themselves this time ngl",
# Legal complexity
"The Licensor owns certain intellectual property rights in and to the Software.",
]
# Test one of them:
# analyze_sentence(challenge_sentences[0])# Your exploration hereTask 3: Cross-Domain Comparison (10 min)¶
Now let’s quantify the differences across domains with some basic statistics.
3.1 Calculate Domain Statistics¶
def domain_statistics(doc, domain_name):
"""Calculate comprehensive statistics for a document."""
sentences = list(doc.sents)
# Basic counts
stats = {
"domain": domain_name,
"num_tokens": len(doc),
"num_sentences": len(sentences),
"num_entities": len(doc.ents),
}
# Average sentence length
stats["avg_sentence_length"] = sum(len(s) for s in sentences) / len(sentences) if sentences else 0
# POS distribution (top 5)
pos_counts = Counter(token.pos_ for token in doc)
stats["pos_distribution"] = pos_counts.most_common(5)
# Entity density (entities per sentence)
stats["entity_density"] = stats["num_entities"] / stats["num_sentences"] if sentences else 0
# Vocabulary richness (unique tokens / total tokens)
unique_tokens = len(set(token.text.lower() for token in doc if token.is_alpha))
alpha_tokens = len([t for t in doc if t.is_alpha])
stats["vocab_richness"] = unique_tokens / alpha_tokens if alpha_tokens else 0
# Stop word ratio
stop_words = sum(1 for token in doc if token.is_stop)
stats["stop_word_ratio"] = stop_words / len(doc) if doc else 0
return stats
# Calculate stats for all domains
all_stats = []
for domain, doc in docs.items():
stats = domain_statistics(doc, domain)
all_stats.append(stats)
# Display comparison
print(f"{'Domain':<10} | {'Tokens':<7} | {'Sents':<6} | {'Ents':<5} | {'Avg Sent Len':<12} | {'Entity Dens':<11} | {'Vocab Rich':<10}")
print("-" * 85)
for stats in all_stats:
print(f"{stats['domain']:<10} | {stats['num_tokens']:<7} | {stats['num_sentences']:<6} | {stats['num_entities']:<5} | {stats['avg_sentence_length']:<12.1f} | {stats['entity_density']:<11.2f} | {stats['vocab_richness']:<10.2f}")Domain | Tokens | Sents | Ents | Avg Sent Len | Entity Dens | Vocab Rich
-------------------------------------------------------------------------------------
news | 70 | 3 | 11 | 23.3 | 3.67 | 0.89
clinical | 90 | 7 | 13 | 12.9 | 1.86 | 0.88
social | 61 | 4 | 5 | 15.2 | 1.25 | 0.94
legal | 95 | 1 | 4 | 95.0 | 4.00 | 0.68
3.2 Analysis Questions¶
# Explore POS distributions
print("\n=== POS DISTRIBUTIONS BY DOMAIN ===")
for stats in all_stats:
print(f"\n{stats['domain'].upper()}:")
for pos, count in stats['pos_distribution']:
print(f" {pos}: {count}")
=== POS DISTRIBUTIONS BY DOMAIN ===
NEWS:
NOUN: 16
PROPN: 13
VERB: 7
ADP: 6
PUNCT: 6
CLINICAL:
PROPN: 23
PUNCT: 21
NOUN: 16
NUM: 11
ADP: 4
SOCIAL:
NOUN: 13
ADV: 7
VERB: 6
PROPN: 5
PUNCT: 5
LEGAL:
NOUN: 15
PUNCT: 12
ADJ: 12
DET: 10
VERB: 10
Work with your group to answer:
Which domain has the longest sentences on average? Why might this be?
Which domain has the highest entity density? What does this tell you about the content?
Look at the POS distributions. How does social media differ from legal text in terms of word types used?
Vocabulary richness measures unique words vs. total words. Which domain is most repetitive? Most varied? Why?
# Your exploration hereShare-Out (10 min)¶
Prepare a 2-minute presentation for the class covering:
Your most interesting finding: What surprised you most about how SpaCy handled these texts?
A domain-specific challenge: Pick one domain and explain the biggest NLP challenge you discovered.
A practical recommendation: If you were building an NLP system for one of these domains, what would you do differently than using SpaCy out-of-the-box?
Bonus Challenge: Entity Co-occurrence (If time permits)¶
For groups that finish early: find which entities appear together in the same sentence. This is a preview of knowledge graph extraction — a technique we’ll explore later in the course.
def find_entity_pairs(doc):
"""Find pairs of entities that co-occur in the same sentence."""
pairs = []
for sent in doc.sents:
# Get entities in this sentence
sent_ents = [ent for ent in doc.ents
if ent.start >= sent.start and ent.end <= sent.end]
# Find all unique pairs
for i, ent1 in enumerate(sent_ents):
for ent2 in sent_ents[i+1:]:
pairs.append({
"entity1": ent1.text,
"label1": ent1.label_,
"entity2": ent2.text,
"label2": ent2.label_,
"sentence": sent.text.strip()
})
return pairs
# Find entity pairs in the news text
pairs = find_entity_pairs(docs["news"])
print("=== ENTITY CO-OCCURRENCES (News) ===")
for pair in pairs:
print(f"\n{pair['entity1']} ({pair['label1']}) <-> {pair['entity2']} ({pair['label2']})")
print(f" Context: {pair['sentence'][:80]}...")=== ENTITY CO-OCCURRENCES (News) ===
Apple Inc. (ORG) <-> Tuesday (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Apple Inc. (ORG) <-> Tim Cook (PERSON)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Apple Inc. (ORG) <-> the European Union (ORG)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Apple Inc. (ORG) <-> Brussels (GPE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Apple Inc. (ORG) <-> next month (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tuesday (DATE) <-> Tim Cook (PERSON)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tuesday (DATE) <-> the European Union (ORG)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tuesday (DATE) <-> Brussels (GPE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tuesday (DATE) <-> next month (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tim Cook (PERSON) <-> the European Union (ORG)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tim Cook (PERSON) <-> Brussels (GPE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Tim Cook (PERSON) <-> next month (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
the European Union (ORG) <-> Brussels (GPE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
the European Union (ORG) <-> next month (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
Brussels (GPE) <-> next month (DATE)
Context: Apple Inc. announced Tuesday that CEO Tim Cook will visit the European Union
...
EU (ORG) <-> third (ORDINAL)
Context: The tech
giant has been at odds with EU regulators over the Digital Markets ...
Apple (ORG) <-> 2.3% (PERCENT)
Context: Apple's stock rose
2.3% on the news, closing at $187.50....
Apple (ORG) <-> 187.50 (MONEY)
Context: Apple's stock rose
2.3% on the news, closing at $187.50....
2.3% (PERCENT) <-> 187.50 (MONEY)
Context: Apple's stock rose
2.3% on the news, closing at $187.50....
Discussion: What relationships do these co-occurrences reveal? How could this be useful for building a knowledge base?
Wrap-Up¶
What You Practiced¶
Applied NLP to real-world text from diverse domains
Critically evaluated model performance — not just running code, but understanding outputs
Discovered domain-specific challenges that general-purpose models face
Built intuition for when NLP works well and when it struggles
Key Insights¶
Domain matters: A model trained on news struggles with clinical abbreviations and social media slang
Evaluation is crucial: Counting entities isn’t enough — you need to verify correctness
Text structure varies wildly: Sentence length, vocabulary, and POS distributions differ dramatically across domains
No model is perfect: Knowing limitations is as important as knowing capabilities
Looking Ahead¶
Next week, we’ll dive into text preprocessing — the techniques for cleaning and normalizing text before analysis. You’ll learn how to handle many of the challenges you discovered today: abbreviations, informal language, domain-specific terminology.