NLP Glossary & Quick Reference

CAP-6640: Computational Understanding of Natural Language
Spencer Lyon

This glossary serves as a living reference document for key terminology encountered throughout the course. Terms are organized by the week in which they are first introduced. Use this page to quickly look up unfamiliar concepts.

Week 1: Foundations & Overview¶

Natural Language Processing (NLP): The field of computer science and artificial intelligence concerned with enabling computers to understand, interpret, and generate human language. NLP bridges linguistics, computer science, and machine learning.
corpus: A large, structured collection of text used for linguistic analysis or training language models. Plural: corpora. Examples include Wikipedia dumps, news archives, or social media datasets.
token: A single unit of text, typically a word, subword, or character, that serves as input to an NLP model. The process of splitting text into tokens is called tokenization.
tokenization: The process of breaking raw text into smaller units called tokens. Different strategies exist: word-level, character-level, and subword-level (e.g., Byte Pair Encoding).
structured data: Data organized in a predefined format with clear relationships, typically stored in tables with rows and columns. Examples: databases, spreadsheets, CSV files. Contrast with unstructured data.
unstructured data: Data without a predefined format or organization, such as free-form text, images, audio, or video. Most human-generated content (emails, articles, social media posts) is unstructured. Contrast with structured data.
rule-based system: An NLP approach that relies on hand-crafted linguistic rules and patterns rather than learned statistical models. Common in early NLP systems. Example: using regular expressions to extract phone numbers.
foundation model: A large-scale model trained on broad data that can be adapted to many downstream tasks. Examples include BERT, GPT, and LLaMA. Also called pretrained models. See also: LLM.
Large Language Model, LLM: A type of foundation model specifically trained on text data, typically with billions of parameters. LLMs can generate, summarize, translate, and reason about text. Examples: GPT-4, Claude, LLaMA.
machine translation: The task of automatically translating text from one language to another. One of the oldest NLP applications, now dominated by neural approaches.
sentiment analysis: The task of determining the emotional tone or opinion expressed in text (e.g., positive, negative, neutral). Common in social media monitoring and customer feedback analysis.
named entity recognition, NER: The task of identifying and classifying named entities (people, organizations, locations, dates, etc.) in text. Example: extracting “UCF” as an organization from a sentence.
chatbot: A software application that simulates human conversation through text or voice. Modern chatbots are often powered by LLMs.
question answering, QA: The task of automatically answering questions posed in natural language, either from a given context (extractive QA) or from learned knowledge (generative QA).
summarization: The task of producing a shorter version of a document while preserving its key information. Can be extractive (selecting sentences) or abstractive (generating new text).
information retrieval, IR: The task of finding relevant documents or passages from a large collection given a query. Search engines are the most common IR application.
SpaCy: An open-source Python library for industrial-strength NLP. Provides efficient implementations of tokenization, POS tagging, NER, dependency parsing, and more via processing pipelines.
dependency parsing: The task of analyzing the grammatical structure of a sentence to determine how words relate to each other. Creates a tree structure showing syntactic dependencies between words.
part-of-speech tagging, POS tagging: The task of labeling each word in a sentence with its grammatical role (noun, verb, adjective, etc.). Essential for understanding sentence structure.
Doc: In SpaCy, the primary container object that holds processed text and all linguistic annotations. Created when text is processed through an nlp pipeline.
text classification: The task of assigning predefined categories or labels to text documents. Examples include spam detection, sentiment analysis, and topic classification.

Week 2: Text Processing¶

stemming: A text normalization technique that reduces words to their root form by removing suffixes using heuristic rules. Example: “running” → “run”. Faster but less accurate than lemmatization.
lemmatization: A text normalization technique that reduces words to their dictionary form (lemma) using vocabulary and morphological analysis. Example: “better” → “good”. More accurate than stemming but slower.
stop word: A common word (e.g., “the”, “is”, “at”) often filtered out during text preprocessing because it carries little semantic meaning. Stop word lists are language-specific.
normalization: The process of transforming text into a standard, consistent format. Includes lowercasing, removing punctuation, expanding contractions, and applying stemming or lemmatization.
regular expression, regex: A sequence of characters defining a search pattern, used for text matching, extraction, and substitution. Essential for text cleaning and pattern-based tokenization.
pipeline: In NLP, a sequence of processing steps applied to text. In SpaCy, a pipeline consists of components (tokenizer, tagger, parser, NER) that process a document in order.

Week 3: Text Representation¶

bag of words: A text representation that treats a document as an unordered collection of words, ignoring grammar and word order. Each document becomes a vector of word counts. Also called BoW.
TF-IDF: A numerical statistic reflecting how important a word is to a document within a corpus. Combines term frequency (how often a word appears) with inverse document frequency (how rare it is across documents). Stands for Term Frequency–Inverse Document Frequency.
sparse representation: A vector representation where most values are zero. Bag of words and TF-IDF produce sparse vectors. Contrast with dense representation.
dense representation: A vector representation where most or all values are non-zero, typically with lower dimensionality than sparse representations. Word embeddings are dense.
word embedding: A dense vector representation of a word that captures semantic meaning. Words with similar meanings have similar vectors. See: Word2Vec, GloVe.
Word2Vec: A neural network-based method for learning word embeddings from text. Uses either CBOW (predict word from context) or Skip-gram (predict context from word) architectures.
GloVe: A method for learning word embeddings by factorizing word co-occurrence matrices. Captures both local and global statistical information. Stands for Global Vectors for Word Representation.
BPE: A subword tokenization algorithm that iteratively merges the most frequent character pairs. Used by GPT models. Balances vocabulary size with handling of rare words. Stands for Byte Pair Encoding.
subword tokenization: A tokenization approach that breaks words into smaller units (subwords). Handles rare and out-of-vocabulary words effectively. See: BPE, WordPiece, SentencePiece.

Week 4: Classical NLP Tasks¶

Naive Bayes: A probabilistic classifier based on Bayes’ theorem with strong independence assumptions between features. Despite its simplicity, often effective for text classification.
SVM, Support Vector Machine: A supervised learning algorithm that finds the optimal hyperplane separating classes. Effective for high-dimensional text data.
precision: The fraction of positive predictions that are correct. Precision = TP / (TP + FP). See also: recall, F1 score.
recall: The fraction of actual positives that are correctly identified. Recall = TP / (TP + FN). See also: precision, F1 score.
F1 score: The harmonic mean of precision and recall. F1 = 2 * (precision * recall) / (precision + recall). Balances both metrics.
confusion matrix: A table showing the counts of true positives, true negatives, false positives, and false negatives for a classifier. Used to compute precision, recall, and other metrics.
sequence labeling: The task of assigning a label to each token in a sequence, rather than a single label to an entire document. Key examples include POS tagging and NER.
BIO tagging: A labeling scheme for sequence labeling that marks each token as B (beginning of an entity), I (inside/continuation of an entity), or O (outside any entity). Enables the representation of multi-word entities at the token level.
K-means clustering: An unsupervised learning algorithm that partitions data into K groups by minimizing the distance from each point to its cluster centroid. For text, operates on TF-IDF or embedding vectors.
topic modeling: An unsupervised approach to discovering latent themes (topics) in a collection of documents. Each topic is a distribution over words, and each document is a mixture of topics. See: LDA.
LDA, Latent Dirichlet Allocation: A generative probabilistic model for topic modeling. Assumes each document is a mixture of topics and each topic is a distribution over words. Discovers latent themes from word co-occurrence patterns.

Week 5: Neural Networks for NLP¶

neural network: A computational model inspired by biological neurons, consisting of layers of interconnected nodes that learn to transform inputs into outputs through training.
feed-forward network: A neural network where information flows in one direction — from input through hidden layers to output — with no cycles or loops. The simplest type of neural network architecture.
activation function: A mathematical function applied to a neuron’s output to introduce non-linearity. Common choices include ReLU, sigmoid, and tanh. Without activation functions, a multi-layer network would collapse to a single linear transformation.
backpropagation: An algorithm for computing gradients of the loss function with respect to network weights, enabling training via gradient descent. Works by applying the chain rule backward through the network layers.
RNN, Recurrent Neural Network: A neural network architecture that processes sequences by maintaining a hidden state that captures information from previous steps. Suffers from vanishing gradient problems on long sequences.
LSTM, Long Short-Term Memory: A type of RNN with gating mechanisms that allow it to learn long-range dependencies by controlling what information to remember or forget.
GRU, Gated Recurrent Unit: A simplified variant of LSTM with fewer parameters. Uses reset and update gates to control information flow.
vanishing gradient: A problem in deep networks where gradients become very small during backpropagation, preventing earlier layers from learning. Particularly problematic for RNNs on long sequences.
sequence-to-sequence, seq2seq: A model architecture that maps an input sequence to an output sequence, potentially of different length. Used for machine translation, summarization, etc.
attention mechanism: A technique allowing models to focus on relevant parts of the input when producing each part of the output. Foundation for transformer architectures.