Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab: Neural Text Classification

University of Central Florida
Arete Capital Partners

CAP-6640: Computational Understanding of Natural Language

Spencer Lyon

Prerequisites

Outcomes

References


Lab Overview

In Parts 01 and 02 we covered the theory behind feed-forward and recurrent neural networks. Now it’s time to put them to the test.

Here’s our game plan:

  1. Reuse the IMDB data pipeline from Part 01 — vocabulary, encoding, data loaders

  2. Build the feed-forward classifier (from Part 01) and a new LSTM classifier

  3. Train both on the same data with the same hyperparameters

  4. Compare them head-to-head — accuracy, loss curves, training time, and how they handle tricky sentences

The big question: does the LSTM’s ability to process word order actually produce better results? The answer may surprise you.


Setup: Data Pipeline

We’ll reuse the exact data pipeline from Part 01. If this code looks familiar, it should — the only new piece is a reusable train_model function that works with any architecture.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from datasets import load_dataset
from collections import Counter
import matplotlib.pyplot as plt
import time
# Load IMDB — same subset and seed as Part 01
dataset = load_dataset("imdb")
train_data = dataset["train"].shuffle(seed=42).select(range(5000))
test_data = dataset["test"].shuffle(seed=42).select(range(1000))

def build_vocab(texts, max_vocab=10000):
    """Build a vocabulary mapping words to integer indices."""
    counter = Counter()
    for text in texts:
        counter.update(text.lower().split())
    vocab = {"<pad>": 0, "<unk>": 1}
    for word, _ in counter.most_common(max_vocab - 2):
        vocab[word] = len(vocab)
    return vocab

def encode_texts(texts, vocab, max_len=256):
    """Convert texts to padded integer sequences."""
    encoded = []
    for text in texts:
        tokens = text.lower().split()[:max_len]
        indices = [vocab.get(t, vocab["<unk>"]) for t in tokens]
        indices += [vocab["<pad>"]] * (max_len - len(indices))
        encoded.append(indices)
    return torch.tensor(encoded)

vocab = build_vocab(train_data["text"])
X_train = encode_texts(train_data["text"], vocab)
y_train = torch.tensor(train_data["label"])
X_test = encode_texts(test_data["text"], vocab)
y_test = torch.tensor(test_data["label"])

train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=64, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=64)

print(f"Vocabulary size: {len(vocab):,}")
print(f"Training set:    {X_train.shape}")
print(f"Test set:        {X_test.shape}")
Vocabulary size: 10,000
Training set:    torch.Size([5000, 256])
Test set:        torch.Size([1000, 256])

A Reusable Training Function

Instead of writing a training loop twice, let’s write one function that works with any nn.Module. It returns a history dictionary with loss curves, accuracy, training time, and parameter count — everything we need for comparison.

def train_model(model, train_loader, test_loader, epochs=10, lr=0.001):
    """Train a model and return training history with metrics."""
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    history = {"train_loss": [], "test_loss": [], "test_acc": []}
    start_time = time.time()

    for epoch in range(epochs):
        # Training
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(X_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        history["train_loss"].append(epoch_loss / len(train_loader))

        # Evaluation
        model.eval()
        test_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                output = model(X_batch)
                test_loss += criterion(output, y_batch).item()
                preds = output.argmax(dim=1)
                correct += (preds == y_batch).sum().item()
                total += len(y_batch)

        history["test_loss"].append(test_loss / len(test_loader))
        history["test_acc"].append(correct / total)

        print(
            f"  Epoch {epoch+1:2d} | "
            f"Train Loss: {history['train_loss'][-1]:.4f} | "
            f"Test Loss: {history['test_loss'][-1]:.4f} | "
            f"Test Acc: {history['test_acc'][-1]:.3f}"
        )

    history["time"] = time.time() - start_time
    history["params"] = sum(p.numel() for p in model.parameters())
    return history

Model 1: Feed-Forward Baseline

This is the same FeedForwardClassifier from Part 01: embedding → mean pooling → hidden layer → output. It ignores word order entirely.

class FeedForwardClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        embeds = self.embedding(x)
        mask = (x != 0).unsqueeze(-1).float()
        pooled = (embeds * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
        hidden = self.relu(self.fc1(pooled))
        return self.fc2(hidden)
torch.manual_seed(42)
ff_model = FeedForwardClassifier(len(vocab), embed_dim=64, hidden_dim=128, num_classes=2)

print("=== Training Feed-Forward Classifier ===\n")
ff_history = train_model(ff_model, train_loader, test_loader)
print(f"\nTraining time: {ff_history['time']:.1f}s | Parameters: {ff_history['params']:,}")
=== Training Feed-Forward Classifier ===

  Epoch  1 | Train Loss: 0.6800 | Test Loss: 0.6661 | Test Acc: 0.611
  Epoch  2 | Train Loss: 0.6127 | Test Loss: 0.5922 | Test Acc: 0.689
  Epoch  3 | Train Loss: 0.5032 | Test Loss: 0.5227 | Test Acc: 0.731
  Epoch  4 | Train Loss: 0.4096 | Test Loss: 0.4828 | Test Acc: 0.767
  Epoch  5 | Train Loss: 0.3349 | Test Loss: 0.4687 | Test Acc: 0.781
  Epoch  6 | Train Loss: 0.2750 | Test Loss: 0.4638 | Test Acc: 0.789
  Epoch  7 | Train Loss: 0.2244 | Test Loss: 0.4654 | Test Acc: 0.799
  Epoch  8 | Train Loss: 0.1922 | Test Loss: 0.4881 | Test Acc: 0.792
  Epoch  9 | Train Loss: 0.1467 | Test Loss: 0.5075 | Test Acc: 0.804
  Epoch 10 | Train Loss: 0.1175 | Test Loss: 0.5077 | Test Acc: 0.809

Training time: 6.3s | Parameters: 648,578

Model 2: LSTM Classifier

Now let’s build the LSTM version. The key architectural change: instead of averaging all word embeddings into one vector (which discards word order), we pass the embeddings through an LSTM and use the final hidden state as our document representation.

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        embeds = self.embedding(x)                 # (batch, seq_len, embed_dim)
        output, (h_n, c_n) = self.lstm(embeds)     # h_n: (1, batch, hidden_dim)
        hidden = h_n.squeeze(0)                    # (batch, hidden_dim)
        return self.fc(hidden)                     # (batch, num_classes)

Notice how simple the swap is — we replaced mean pool + linear + relu + linear with LSTM + linear. The LSTM’s hidden state already contains a non-linear, sequence-aware representation, so one output layer is enough.

Let’s trace the shapes to make sure everything connects:

temp_model = LSTMClassifier(len(vocab), embed_dim=64, hidden_dim=128, num_classes=2)
sample = X_train[:2]

with torch.no_grad():
    print(f"Input:           {sample.shape}")
    embeds = temp_model.embedding(sample)
    print(f"After embedding: {embeds.shape}")
    output, (h_n, c_n) = temp_model.lstm(embeds)
    print(f"LSTM outputs:    {output.shape}")
    print(f"Final hidden:    {h_n.shape}")
    hidden = h_n.squeeze(0)
    print(f"Squeezed:        {hidden.shape}")
    logits = temp_model.fc(hidden)
    print(f"Final output:    {logits.shape}")

del temp_model
Input:           torch.Size([2, 256])
After embedding: torch.Size([2, 256, 64])
LSTM outputs:    torch.Size([2, 256, 128])
Final hidden:    torch.Size([1, 2, 128])
Squeezed:        torch.Size([2, 128])
Final output:    torch.Size([2, 2])

Now let’s train it:

torch.manual_seed(42)
lstm_model = LSTMClassifier(len(vocab), embed_dim=64, hidden_dim=128, num_classes=2)

print("=== Training LSTM Classifier ===\n")
lstm_history = train_model(lstm_model, train_loader, test_loader)
print(f"\nTraining time: {lstm_history['time']:.1f}s | Parameters: {lstm_history['params']:,}")
=== Training LSTM Classifier ===

  Epoch  1 | Train Loss: 0.6936 | Test Loss: 0.6960 | Test Acc: 0.494
  Epoch  2 | Train Loss: 0.6861 | Test Loss: 0.7019 | Test Acc: 0.494
  Epoch  3 | Train Loss: 0.6725 | Test Loss: 0.7046 | Test Acc: 0.503
  Epoch  4 | Train Loss: 0.6453 | Test Loss: 0.7238 | Test Acc: 0.499
  Epoch  5 | Train Loss: 0.6043 | Test Loss: 0.7555 | Test Acc: 0.495
  Epoch  6 | Train Loss: 0.5582 | Test Loss: 0.8113 | Test Acc: 0.498
  Epoch  7 | Train Loss: 0.5266 | Test Loss: 0.8899 | Test Acc: 0.505
  Epoch  8 | Train Loss: 0.5004 | Test Loss: 0.9725 | Test Acc: 0.505
  Epoch  9 | Train Loss: 0.4864 | Test Loss: 0.9580 | Test Acc: 0.507
  Epoch 10 | Train Loss: 0.5034 | Test Loss: 0.9906 | Test Acc: 0.511

Training time: 152.9s | Parameters: 739,586

The LSTM takes noticeably longer to train — it must process each of the 256 time steps sequentially, while the feed-forward model processes the entire sequence in one matrix operation. This is exactly the scalability limitation that motivates Transformers (Week 6).


Head-to-Head Comparison

Let’s see how our two architectures stack up.

Loss Curves and Accuracy

epochs = range(1, 11)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Training loss
axes[0].plot(epochs, ff_history["train_loss"], marker="o", label="Feed-Forward")
axes[0].plot(epochs, lstm_history["train_loss"], marker="s", label="LSTM")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].set_title("Training Loss")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test loss
axes[1].plot(epochs, ff_history["test_loss"], marker="o", label="Feed-Forward")
axes[1].plot(epochs, lstm_history["test_loss"], marker="s", label="LSTM")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Loss")
axes[1].set_title("Test Loss")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Test accuracy
axes[2].plot(epochs, ff_history["test_acc"], marker="o", label="Feed-Forward")
axes[2].plot(epochs, lstm_history["test_acc"], marker="s", label="LSTM")
axes[2].set_xlabel("Epoch")
axes[2].set_ylabel("Accuracy")
axes[2].set_title("Test Accuracy")
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
<Figure size 1500x400 with 3 Axes>

Summary Table

print(f"{'Metric':<25} {'Feed-Forward':>15} {'LSTM':>15}")
print("-" * 57)
print(f"{'Parameters':<25} {ff_history['params']:>15,} {lstm_history['params']:>15,}")
print(f"{'Training time':<25} {ff_history['time']:>14.1f}s {lstm_history['time']:>14.1f}s")
print(f"{'Best test accuracy':<25} {max(ff_history['test_acc']):>15.3f} {max(lstm_history['test_acc']):>15.3f}")
print(f"{'Final test accuracy':<25} {ff_history['test_acc'][-1]:>15.3f} {lstm_history['test_acc'][-1]:>15.3f}")
print(f"{'Final train loss':<25} {ff_history['train_loss'][-1]:>15.4f} {lstm_history['train_loss'][-1]:>15.4f}")
print(f"{'Final test loss':<25} {ff_history['test_loss'][-1]:>15.4f} {lstm_history['test_loss'][-1]:>15.4f}")
Metric                       Feed-Forward            LSTM
---------------------------------------------------------
Parameters                        648,578         739,586
Training time                        6.3s          152.9s
Best test accuracy                  0.809           0.511
Final test accuracy                 0.809           0.511
Final train loss                   0.1175          0.5034
Final test loss                    0.5077          0.9906

What Do We See?

A few observations worth discussing:

The lesson here is important: a more complex model isn’t automatically better. The LSTM’s ability to model word order is a genuine advantage, but whether it matters depends on the task and data.

Testing on Tricky Sentences

Let’s see how both models handle sentences where word order really matters:

def predict_review(model, text, vocab):
    """Predict sentiment for a single review."""
    model.eval()
    encoded = encode_texts([text], vocab)
    with torch.no_grad():
        output = model(encoded)
        probs = torch.softmax(output, dim=1)
        pred = output.argmax(dim=1).item()
    label = "positive" if pred == 1 else "negative"
    confidence = probs[0, pred].item()
    return label, confidence

test_sentences = [
    "This movie was absolutely brilliant and moving",
    "This movie was terrible and boring",
    "This movie was not good",
    "Not my favorite, but I would watch it again",
    "Despite great acting, the plot was weak and unconvincing",
]

print(f"{'Sentence':<55} {'FF Pred':>10} {'FF Conf':>8} {'LSTM Pred':>10} {'LSTM Conf':>8}")
print("-" * 95)

for sent in test_sentences:
    ff_label, ff_conf = predict_review(ff_model, sent, vocab)
    lstm_label, lstm_conf = predict_review(lstm_model, sent, vocab)
    short = sent[:52] + "..." if len(sent) > 55 else sent
    print(f"{short:<55} {ff_label:>10} {ff_conf:>8.3f} {lstm_label:>10} {lstm_conf:>8.3f}")
Sentence                                                   FF Pred  FF Conf  LSTM Pred LSTM Conf
-----------------------------------------------------------------------------------------------
This movie was absolutely brilliant and moving            positive    0.998   negative    0.500
This movie was terrible and boring                        negative    1.000   negative    0.500
This movie was not good                                   negative    0.972   negative    0.500
Not my favorite, but I would watch it again               positive    0.997   negative    0.500
Despite great acting, the plot was weak and unconvin...   negative    1.000   negative    0.500

Pay attention to the sentences with negation (“not good”) and contrast (“despite great acting... weak”). These are the cases where word order matters most, and where we’d expect the LSTM to have an advantage.


Hyperparameter Experiments

Which hyperparameters matter most for neural text classifiers? Let’s run a quick experiment varying the hidden dimension size for the LSTM. We’ll keep everything else fixed (embed_dim=64, lr=0.001, 10 epochs) and compare hidden sizes of 32, 64, 128, and 256.

hidden_dims = [32, 64, 128, 256]
results = {}

for hd in hidden_dims:
    print(f"\n--- Hidden dim = {hd} ---")
    torch.manual_seed(42)
    model = LSTMClassifier(len(vocab), embed_dim=64, hidden_dim=hd, num_classes=2)
    history = train_model(model, train_loader, test_loader, epochs=10)
    results[hd] = history

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

for hd in hidden_dims:
    ax1.plot(range(1, 11), results[hd]["test_acc"], marker="o", label=f"hidden={hd}")
    ax2.plot(range(1, 11), results[hd]["test_loss"], marker="o", label=f"hidden={hd}")

ax1.set_xlabel("Epoch")
ax1.set_ylabel("Test Accuracy")
ax1.set_title("Effect of Hidden Dimension on Accuracy")
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.set_xlabel("Epoch")
ax2.set_ylabel("Test Loss")
ax2.set_title("Effect of Hidden Dimension on Test Loss")
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()

--- Hidden dim = 32 ---
  Epoch  1 | Train Loss: 0.6943 | Test Loss: 0.6959 | Test Acc: 0.490
  Epoch  2 | Train Loss: 0.6884 | Test Loss: 0.6953 | Test Acc: 0.499
  Epoch  3 | Train Loss: 0.6819 | Test Loss: 0.6968 | Test Acc: 0.503
  Epoch  4 | Train Loss: 0.6702 | Test Loss: 0.7008 | Test Acc: 0.508
  Epoch  5 | Train Loss: 0.6480 | Test Loss: 0.7123 | Test Acc: 0.506
  Epoch  6 | Train Loss: 0.6197 | Test Loss: 0.7318 | Test Acc: 0.514
  Epoch  7 | Train Loss: 0.5865 | Test Loss: 0.7603 | Test Acc: 0.518
  Epoch  8 | Train Loss: 0.5524 | Test Loss: 0.7673 | Test Acc: 0.520
  Epoch  9 | Train Loss: 0.5299 | Test Loss: 0.8093 | Test Acc: 0.524
  Epoch 10 | Train Loss: 0.5154 | Test Loss: 0.8455 | Test Acc: 0.518

--- Hidden dim = 64 ---
  Epoch  1 | Train Loss: 0.6946 | Test Loss: 0.6944 | Test Acc: 0.500
  Epoch  2 | Train Loss: 0.6869 | Test Loss: 0.6957 | Test Acc: 0.501
  Epoch  3 | Train Loss: 0.6752 | Test Loss: 0.6971 | Test Acc: 0.506
  Epoch  4 | Train Loss: 0.6526 | Test Loss: 0.7142 | Test Acc: 0.508
  Epoch  5 | Train Loss: 0.6222 | Test Loss: 0.7309 | Test Acc: 0.514
  Epoch  6 | Train Loss: 0.6339 | Test Loss: 0.7231 | Test Acc: 0.502
  Epoch  7 | Train Loss: 0.5623 | Test Loss: 0.7618 | Test Acc: 0.512
  Epoch  8 | Train Loss: 0.5304 | Test Loss: 0.8050 | Test Acc: 0.512
  Epoch  9 | Train Loss: 0.5094 | Test Loss: 0.8285 | Test Acc: 0.526
  Epoch 10 | Train Loss: 0.4904 | Test Loss: 0.8636 | Test Acc: 0.530

--- Hidden dim = 128 ---
  Epoch  1 | Train Loss: 0.6936 | Test Loss: 0.6960 | Test Acc: 0.494
  Epoch  2 | Train Loss: 0.6861 | Test Loss: 0.7019 | Test Acc: 0.494
  Epoch  3 | Train Loss: 0.6725 | Test Loss: 0.7046 | Test Acc: 0.503
  Epoch  4 | Train Loss: 0.6453 | Test Loss: 0.7238 | Test Acc: 0.499
  Epoch  5 | Train Loss: 0.6043 | Test Loss: 0.7555 | Test Acc: 0.495
  Epoch  6 | Train Loss: 0.5582 | Test Loss: 0.8113 | Test Acc: 0.498
  Epoch  7 | Train Loss: 0.5266 | Test Loss: 0.8899 | Test Acc: 0.505
  Epoch  8 | Train Loss: 0.5004 | Test Loss: 0.9725 | Test Acc: 0.505
  Epoch  9 | Train Loss: 0.4864 | Test Loss: 0.9580 | Test Acc: 0.507
  Epoch 10 | Train Loss: 0.5034 | Test Loss: 0.9906 | Test Acc: 0.511

--- Hidden dim = 256 ---
  Epoch  1 | Train Loss: 0.6943 | Test Loss: 0.6949 | Test Acc: 0.500
  Epoch  2 | Train Loss: 0.6835 | Test Loss: 0.7002 | Test Acc: 0.496
  Epoch  3 | Train Loss: 0.6642 | Test Loss: 0.7134 | Test Acc: 0.506
  Epoch  4 | Train Loss: 0.6310 | Test Loss: 0.7387 | Test Acc: 0.509
  Epoch  5 | Train Loss: 0.5887 | Test Loss: 0.7678 | Test Acc: 0.512
  Epoch  6 | Train Loss: 0.5366 | Test Loss: 0.8709 | Test Acc: 0.509
  Epoch  7 | Train Loss: 0.5103 | Test Loss: 0.8714 | Test Acc: 0.503
  Epoch  8 | Train Loss: 0.4988 | Test Loss: 0.9551 | Test Acc: 0.513
  Epoch  9 | Train Loss: 0.4852 | Test Loss: 0.9570 | Test Acc: 0.510
  Epoch 10 | Train Loss: 0.4813 | Test Loss: 1.0496 | Test Acc: 0.522
<Figure size 1200x400 with 2 Axes>
print(f"\n{'Hidden Dim':<12} {'Params':>10} {'Time':>8} {'Best Acc':>10} {'Final Acc':>10}")
print("-" * 54)
for hd in hidden_dims:
    r = results[hd]
    print(f"{hd:<12} {r['params']:>10,} {r['time']:>7.1f}s {max(r['test_acc']):>10.3f} {r['test_acc'][-1]:>10.3f}")

Hidden Dim       Params     Time   Best Acc  Final Acc
------------------------------------------------------
32              652,610    45.5s      0.524      0.518
64              673,410    86.4s      0.530      0.530
128             739,586   194.3s      0.511      0.511
256             970,242   489.0s      0.522      0.522

Notice the trade-off: larger hidden dimensions give the model more capacity but also more parameters and longer training time. At some point you hit diminishing returns — or even overfitting, where a bigger model memorizes training data without generalizing better.


Wrap-Up

Key Takeaways

What’s Next

We’ve now explored the full arc from classical machine learning (Week 4) through feed-forward networks to LSTMs. Along the way, we’ve seen a recurring trade-off: more powerful architectures capture more structure but cost more to train.

In Week 6, we’ll meet the architecture that changed everything: the Transformer. By replacing recurrence with self-attention, Transformers process all tokens in parallel while maintaining the ability to capture long-range dependencies. We’ll build the attention mechanism from scratch and see exactly how it solves the limitations we encountered this week.