The Evolution and Practice of NLP

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

Prerequisites

L01.01: What is NLP?
Basic Python and SpaCy setup complete

Outcomes

Trace the evolution of NLP from rule-based systems to modern LLMs
Understand why each paradigm shift occurred (not just what changed)
Identify major NLP application areas and their industry impact
Use SpaCy to perform tokenization, POS tagging, NER, and dependency parsing
Interpret SpaCy’s linguistic annotations on real text

References

SpaCy Documentation
Jurafsky & Martin, Speech and Language Processing (3rd ed. Draft), Chapters 1 and 2 (download here)

A Brief History of NLP¶

Understanding where NLP came from helps us appreciate where it’s going. The history of the field is a story of big dreams, spectacular failures, and hard-won insights about what actually works.

The 1950s-1960s: The Dream Begins¶

The field of NLP began with remarkable optimism. In 1950, Alan Turing proposed his famous test: could a machine converse so convincingly that a human couldn’t tell it wasn’t human? The goal was set.

In 1954, the Georgetown-IBM experiment demonstrated automatic translation of 60 Russian sentences into English. The researchers boldly predicted that machine translation would be “solved” within 3-5 years.

They were... wrong.

By 1966, the ALPAC report delivered a devastating assessment: machine translation wasn’t anywhere close to working. Funding dried up. The first “AI Winter” had begun.

But that same year, something interesting happened. Joseph Weizenbaum created ELIZA, a simple chatbot that simulated a Rogerian psychotherapist:

Human: I am feeling sad today.
ELIZA: How long have you been feeling sad today?
Human: My mother doesn't understand me.
ELIZA: Tell me more about your family.

ELIZA used simple pattern matching — no real understanding. Yet people found it compelling, even knowing it was a machine. Weizenbaum was disturbed by how easily humans anthropomorphized his creation.

What this era established: The goal of machine language understanding, and the sobering realization that it was much harder than anyone expected.

The 1970s-1980s: Knowledge Engineering¶

If we can’t learn language from data, perhaps we can encode it manually. This era was defined by knowledge engineering — the painstaking process of hand-crafting rules, grammars, and knowledge bases for rule-based systems.

The poster child was SHRDLU (1970), Terry Winograd’s system that could understand and respond to commands about a simulated “blocks world”:

Human: Put the red block on the blue block.
SHRDLU: OK.
Human: What is on the blue block?
SHRDLU: The red block.
Human: Why did you put the red block there?
SHRDLU: Because you asked me to.

Impressive! But notice the constraint: a tiny world with just colored blocks. In this microworld, SHRDLU worked beautifully.

The problem? The real world isn’t a blocks world. Attempts to scale these approaches to unrestricted language failed. The rules became unwieldy, the exceptions multiplied, and the systems grew brittle — breaking on inputs their designers hadn’t anticipated.

By the late 1980s, another AI Winter had set in.

The 1990s-2000s: The Statistical Revolution¶

The breakthrough came from an unexpected direction: speech recognition research at IBM. Fred Jelinek famously quipped (paraphrased):

“Every time I fire a linguist, the performance of our speech recognizer goes up.”

The insight was radical: instead of encoding linguistic rules, let the data decide. Count how often words appear together. Calculate probabilities. Let statistics do the work.

N-gram models exemplify this approach. Consider the phrase:

“recognize speech” vs “wreck a nice beach”

These sound almost identical when spoken. How does a computer choose? By calculating which sequence of words is more probable in English:

P(“recognize speech”) = pretty high (common phrase)
P(“wreck a nice beach”) = pretty low (weird phrase)

No linguistic rules needed — just word frequency statistics from millions of documents.

This era also saw the rise of machine learning for NLP: Naive Bayes for text classification, Hidden Markov Models for part-of-speech tagging, statistical parsing algorithms.

The key insight: More data beats clever algorithms. A simple model trained on massive data often outperforms a complex model with limited data.

The 2010s: Deep Learning Changes Everything¶

The statistical revolution relied on hand-crafted features. Someone had to decide what to count — which word combinations, which patterns. Feature engineering was an art.

Deep learning changed that. Instead of hand-crafted features, neural networks learn their own representations from raw data.

The watershed moment was Word2Vec (2013). Tomas Mikolov and colleagues at Google showed that you could represent words as dense vectors — points in a high-dimensional space — learned purely from text.

The magic? These vectors captured meaning. Words with similar meanings clustered together. And you could do arithmetic:

king - man + woman ≈ queen
paris - france + italy ≈ rome

This wasn’t programmed. It emerged from patterns in billions of words.

Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs) added the ability to process sequences — essential for language, which unfolds word by word.

The key insight: Learned representations beat hand-crafted features. Let the model discover what matters.

2017-Present: The Transformer Revolution¶

In 2017, a paper from Google with a provocative title changed everything: “Attention Is All You Need.”

The Transformer architecture abandoned the sequential processing of RNNs for a mechanism called self-attention that could look at all words in a sentence simultaneously, weighing their relevance to each other.

This enabled unprecedented parallelization and scaling. Models could now be trained on more data, with more parameters, faster than ever before.

What followed was an explosion:

BERT (2018): Bidirectional understanding of text. State-of-the-art on virtually every NLP benchmark.
GPT-2 (2019): Generated such convincing text that OpenAI initially withheld it.
GPT-3 (2020): 175 billion parameters. Few-shot learning. The beginning of “prompt engineering.”
ChatGPT (2022): NLP goes mainstream. Your grandmother knows what a chatbot is now.

Remember the Winograd Schema Challenge from the previous lecture?

"The trophy wouldn't fit in the suitcase because it was too big."

Modern LLMs solve these effortlessly. They’ve seen so much text that common-sense knowledge emerges from patterns.

The key insight: Scale + architecture + data = emergent capabilities. Things that seemed impossible become possible at sufficient scale.

NLP Applications in the Real World¶

Now that we understand where NLP came from, let’s survey where it’s used today. NLP is everywhere — often invisibly.

Core NLP Tasks¶

Task	Description	Examples
Machine Translation	Convert text between languages	Google Translate, DeepL
Sentiment Analysis	Determine emotional tone	Brand monitoring, stock sentiment
Named Entity Recognition	Find people, places, organizations	News extraction, legal discovery
Question Answering	Answer questions from text or knowledge	Search engines, customer support
Summarization	Condense long text	News digests, meeting notes
Chatbots/Dialogue	Conversational interaction	Customer service, virtual assistants
Text Classification	Assign categories to text	Spam detection, topic tagging

Each of these is a multi-billion dollar industry. And each builds on the techniques we’ll learn in this course.

Industry Impact¶

Technology: Search engines are NLP at massive scale. Every Google query is a natural language understanding problem. Recommendations (“customers who liked X also liked Y”) often involve text analysis. Content moderation on social platforms is impossible without NLP.

Healthcare: Clinical notes are unstructured text. Extracting diagnoses, medications, and procedures requires NLP. Drug discovery involves reading millions of research papers. Patient chatbots handle routine questions.

Finance: Sentiment analysis moves markets. Hedge funds analyze news, earnings calls, and social media to predict stock movements. Compliance requires reviewing contracts for specific clauses. Fraud detection examines transaction descriptions.

Legal: Contract analysis extracts key terms and risks. Discovery in litigation involves searching millions of documents. Legal research means finding relevant precedents in case law.

Hands-On: SpaCy Basics¶

Let’s return to code and explore SpaCy’s core concepts in detail. In the previous lecture, we saw the “magic” — now let’s understand what’s actually happening.

The Doc Object¶

When you process text with SpaCy, you get back a Doc object — a container for all the linguistic annotations.

import spacy

nlp = spacy.load("en_core_web_sm")

# Processing text creates a Doc object
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")

A Doc is a sequence of Token objects. Each token has rich annotations:

print("=== TOKENS ===")
for token in doc:
    print(f"{token.text:12} | POS: {token.pos_:6} | DEP: {token.dep_:10} | HEAD: {token.head.text}")

=== TOKENS ===
Apple        | POS: PROPN  | DEP: nsubj      | HEAD: looking
is           | POS: AUX    | DEP: aux        | HEAD: looking
looking      | POS: VERB   | DEP: ROOT       | HEAD: looking
at           | POS: ADP    | DEP: prep       | HEAD: looking
buying       | POS: VERB   | DEP: pcomp      | HEAD: at
a            | POS: DET    | DEP: det        | HEAD: U.K.
U.K.         | POS: PROPN  | DEP: dobj       | HEAD: buying
startup      | POS: NOUN   | DEP: advcl      | HEAD: looking
for          | POS: ADP    | DEP: prep       | HEAD: startup
$            | POS: SYM    | DEP: quantmod   | HEAD: billion
1            | POS: NUM    | DEP: compound   | HEAD: billion
billion      | POS: NUM    | DEP: pobj       | HEAD: for

Let’s break down these attributes:

text: The actual word
pos_: Part-of-speech tagging tag (NOUN, VERB, ADJ, etc.)
dep_: Dependency parsing label (how this word relates to its head)
head: The word this token is syntactically attached to

Named Entities¶

Named Entity Recognition identifies real-world objects: people, organizations, locations, dates, money, etc.

print("=== ENTITIES ===")
for ent in doc.ents:
    print(f"{ent.text:20} | {ent.label_:10} | {spacy.explain(ent.label_)}")

=== ENTITIES ===
Apple                | ORG        | Companies, agencies, institutions, etc.
U.K.                 | GPE        | Countries, cities, states
$1 billion           | MONEY      | Monetary values, including unit

The spacy.explain() function gives you human-readable descriptions of the labels.

Sentence Segmentation¶

SpaCy automatically splits text into sentences:

text = "NLP is fascinating. It powers many applications we use daily. Let's explore further."
doc = nlp(text)

print("=== SENTENCES ===")
for sent in doc.sents:
    print(f"- {sent.text}")

=== SENTENCES ===
- NLP is fascinating.
- It powers many applications we use daily.
- Let's explore further.

Exploring Token Attributes¶

Tokens have many more useful attributes. Let’s explore a few:

doc = nlp("The quick brown foxes are jumping over the lazy dogs.")

print("=== EXTENDED TOKEN INFO ===")
for token in doc:
    print(f"{token.text:10} | lemma: {token.lemma_:10} | is_stop: {token.is_stop} | is_alpha: {token.is_alpha}")

=== EXTENDED TOKEN INFO ===
The        | lemma: the        | is_stop: True | is_alpha: True
quick      | lemma: quick      | is_stop: False | is_alpha: True
brown      | lemma: brown      | is_stop: False | is_alpha: True
foxes      | lemma: fox        | is_stop: False | is_alpha: True
are        | lemma: be         | is_stop: True | is_alpha: True
jumping    | lemma: jump       | is_stop: False | is_alpha: True
over       | lemma: over       | is_stop: True | is_alpha: True
the        | lemma: the        | is_stop: True | is_alpha: True
lazy       | lemma: lazy       | is_stop: False | is_alpha: True
dogs       | lemma: dog        | is_stop: False | is_alpha: True
.          | lemma: .          | is_stop: False | is_alpha: False

lemma_: The base form of the word (“foxes” → “fox”, “jumping” → “jump”)
is_stop: Is this a common stop word (the, a, is, etc.)?
is_alpha: Does the token consist only of alphabetic characters?

These attributes become crucial for text preprocessing, which we’ll cover in depth next week.

Exercise 6.1: Exploring SpaCy

Process the following text and explore the output:

text = "Elon Musk announced that Tesla will open a new factory in Berlin next year."
doc = nlp(text)

Answer these questions:

How many tokens are there? (Hint: len(doc))
What entities were detected? Are they correct?
What is the root verb of the sentence? (Hint: look for token.dep_ == "ROOT")
What is the subject of the sentence? (Hint: look for token.dep_ == "nsubj")

# Your code here

Wrap-Up¶

What We Covered Today¶

The history of NLP — from rule-based dreams to the transformer revolution
Why each paradigm shift mattered — data beats rules, learned features beat hand-crafted ones, scale unlocks emergence
NLP applications — everywhere in tech, healthcare, finance, legal
SpaCy fundamentals — tokens, entities, sentences, and their attributes

What’s Next¶

In the next session, you’ll apply what you’ve learned in a hands-on lab. You’ll analyze text from different domains, discover what SpaCy gets right and wrong, and start building intuition for the challenges of real-world NLP.