Pre-Transformer Models: RNN, LSTM, and GRU

Overview

In our previous lessons, we've explored word representations from static embeddings to contextual embeddings. But a critical question remains: how do we effectively process sequences of these word representations to understand the full meaning of sentences, paragraphs, and documents?

This lesson introduces Recurrent Neural Networks (RNNs), the foundational architecture for sequential data processing in NLP. Before transformers became the dominant paradigm, RNNs and their variants (LSTM, GRU) were the state-of-the-art for tasks like language modeling, machine translation, and sentiment analysis.

Learning Objectives

After completing this lesson, you will be able to:

Understand why sequential data requires specialized neural architectures
Explain the basic RNN architecture and its recurrence mechanism
Describe the vanishing/exploding gradient problems in vanilla RNNs
Compare LSTM and GRU architectures and their advantages
Implement RNN variants for common NLP tasks
Recognize the limitations that led to the transformer revolution

The Sequential Nature of Language

The Challenge of Variable-Length Input

Traditional neural networks expect fixed-size inputs, but language is inherently variable in length:

Sentences can be short ("I agree.") or very long
Documents can range from tweets to novels
Conversations can have arbitrary turns and lengths

How do we design neural networks that can handle this variability while preserving the sequential relationships?

Analogy: Understanding Music

Consider how you understand music. A single note in isolation gives limited information, but as you hear sequences of notes, you build an understanding of the melody, rhythm, and emotional tone.

If you were to hear only random isolated notes, you'd lose the temporal patterns that make music meaningful. Similarly, to understand language, we need to process words not in isolation, but as part of a meaningful sequence while maintaining the memory of what came before.

Why Feed-Forward Networks Fall Short

Requirement	Feed-Forward Networks	Recurrent Networks
Variable-length input	Fixed input size	Can handle any sequence length
Parameter sharing across positions	No position-specific parameters	Same weights used at each time step
Memory of previous inputs	No memory mechanism	State vector carries information forward
Order sensitivity	Order agnostic	Order matters
Position awareness	No positional awareness	Position implicitly encoded through recurrence

Recurrent Neural Networks: The Basic Architecture

The Recurrence Mechanism

The key innovation in RNNs is the recurrence mechanism: the network maintains a hidden state (or "memory") that is updated at each time step based on both the current input and the previous hidden state.

RNN Architecture Explorer

Explore different RNN architectures and see how they evolved to solve various problems. Use the tabs below to compare vanilla RNNs, LSTMs, GRUs, and bidirectional variants.

Vanilla Recurrent Neural Network

Vanilla RNNs pass information forward in time, but struggle with long-term dependencies due to vanishing gradients.

Architecture

Display Options

💡 Tip: Switch between the different architecture types using the buttons above to see how each variant addresses the limitations of vanilla RNNs.

Mathematical Formulation

At each time step $t$ , the vanilla RNN computes:

$\mathbf{h}_t = f(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{hx}\mathbf{x}_t + \mathbf{b}_h)$ $\mathbf{y}_t = g(\mathbf{W}_{yh}\mathbf{h}_t + \mathbf{b}_y)$

Where:

$\mathbf{x}_t$ is the input at time step $t$ (e.g., a word embedding)
$\mathbf{h}_t$ is the hidden state at time step $t$
$\mathbf{h}_{t-1}$ is the hidden state from the previous time step
$\mathbf{y}_t$ is the output at time step $t$
$\mathbf{W}_{hh}$ , $\mathbf{W}_{hx}$ , and $\mathbf{W}_{yh}$ are weight matrices
$\mathbf{b}_h$ and $\mathbf{b}_y$ are bias vectors
$f$ is typically tanh or ReLU activation function
$g$ is an output activation function (e.g., softmax for classification)

Parameter Sharing

A key advantage of RNNs is parameter sharing across time steps. The same weights are used at each step, which:

Drastically reduces the number of parameters
Allows processing sequences of any length
Enables the network to recognize patterns regardless of position

Training RNNs: Backpropagation Through Time (BPTT)

RNNs are trained using an extension of backpropagation called Backpropagation Through Time (BPTT), which unfolds the recurrent network through time and treats it as a deep feed-forward network.

Training Visualization and RNN Comparison

Explore how different RNN architectures handle training challenges. Use the "Backpropagation" tab to see gradient flow problems, and the "RNN Comparison" tab to compare performance across architectures.

Backpropagation Through Time (BPTT)

Visualizing how gradients flow backward through an unfolded RNN during training.

Gradient Problem

Cell Type

Forward Pass

This visualization shows how VANILLA handles gradient flow during training with vanishing gradients.

💡 Tip: Try switching between "Backpropagation" and "RNN Comparison" tabs to see both training dynamics and performance differences.

Long Short-Term Memory (LSTM): Solving the Long-Term Dependency Problem

To address the vanishing gradient problem, Hochreiter and Schmidhuber introduced the Long Short-Term Memory (LSTM) architecture in 1997. LSTMs use a more complex recurrent unit with gates that control information flow.

LSTM Architecture

👆 Use the Architecture Explorer above and select "LSTM" to see the detailed gate structure and how it differs from vanilla RNNs.

The Gate Mechanism

An LSTM cell contains three gates that regulate information flow:

Forget Gate: Decides what information to discard from the cell state
Input Gate: Decides what new information to store in the cell state
Output Gate: Decides what parts of the cell state to output

Mathematical Formulation

For input $\mathbf{x}_t$ at time step $t$ :

Forget Gate: $\mathbf{f}_t = \sigma(\mathbf{W}_f \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$

Input Gate: $\mathbf{i}_t = \sigma(\mathbf{W}_i \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$ $\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_C)$

Cell State Update: $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$

Output Gate: $\mathbf{o}_t = \sigma(\mathbf{W}_o \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$ $\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)$

Where:

$\sigma$ is the sigmoid function
$\odot$ represents element-wise multiplication
$\mathbf{C}_t$ is the cell state at time $t$
$\mathbf{h}_t$ is the hidden state at time $t$
$\mathbf{W}$ and $\mathbf{b}$ are weight matrices and bias vectors

Memory Management Analogy

Think of the LSTM cell as a skilled personal assistant managing your information flow:

Forget Gate: Like clearing your desk of irrelevant papers
Input Gate: Like deciding which new information deserves to be filed away
Cell State: Like your organized filing cabinet of important information
Output Gate: Like preparing a briefing of only the relevant information you need right now

Addressing Long-Term Dependencies

LSTMs excel at capturing long-term dependencies through their explicit memory mechanism. The combination of the cell state (long-term memory) and hidden state (working memory) allows LSTMs to maintain relevant information across many time steps while forgetting irrelevant details.

Gated Recurrent Unit (GRU): A Streamlined Alternative

Introduced in 2014 by Cho et al., the Gated Recurrent Unit (GRU) is a simplified variant of the LSTM that combines the forget and input gates into a single "update gate."

GRU Architecture

👆 Use the Architecture Explorer above and select "GRU" to see how it simplifies the LSTM design while maintaining effectiveness.

Mathematical Formulation

For input $\mathbf{x}_t$ at time step $t$ :

Update Gate: $\mathbf{z}_t = \sigma(\mathbf{W}_z \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)$

Reset Gate: $\mathbf{r}_t = \sigma(\mathbf{W}_r \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)$

Candidate Hidden State: $\tilde{\mathbf{h}}_t = \tanh(\mathbf{W} \cdot [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b})$

Final Hidden State: $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$

LSTM vs. GRU: Comparison

Feature	LSTM	GRU
Parameters	More (4 sets of weights and biases)	Fewer (3 sets of weights and biases)
Memory unit	Cell state and hidden state	Hidden state only
Gates	Forget, input, and output gates	Update and reset gates
Training speed	Slower	Faster
Performance on very long dependencies	Slightly better	Good
Computational efficiency	More computation	Less computation

Note: GRUs typically train faster and require fewer parameters, but LSTMs may perform better on certain tasks, especially those requiring fine-grained memory control.

Bidirectional RNNs: Capturing Context from Both Directions

In many NLP tasks, understanding a word requires context from both past and future words. Bidirectional RNNs process the sequence in both forward and backward directions.

Bidirectional Architecture

👆 Use the Architecture Explorer above and select "Bidirectional" to see how information flows in both directions.

Benefits for NLP Tasks

Bidirectional processing is especially valuable for:

Named Entity Recognition
Part-of-Speech Tagging
Machine Translation
Question Answering

Example: Disambiguating Word Sense

The word "bank" has different meanings depending on context. Bidirectional RNNs can use both past and future context to determine the correct interpretation.

Example contexts:

"I went to the bank to deposit money" (financial institution)
"We sat by the river bank watching the sunset" (edge of water)
"The pilot had to bank the airplane to the left" (to tilt)

Bidirectional RNNs excel at these disambiguation tasks because they can consider the full sentence context.

Common NLP Applications of RNNs

Language Modeling

Language modeling is the task of predicting the next word given a sequence of previous words. This is a fundamental NLP task with applications in:

Speech recognition
Machine translation
Text generation
Spelling correction

Code Example: Simple Character-Level Language Model

python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from collections import Counter
import random

# Sample text data
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence

Sentiment Analysis

Sentiment analysis determines the emotional tone behind text, often used for customer reviews, social media monitoring, and brand analysis.

Code Example: Sentiment Classification with LSTM

python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
texts = [

Machine Translation with Encoder-Decoder Architecture

Machine translation uses a sequence-to-sequence (Seq2Seq) architecture with an encoder RNN and a decoder RNN.

Interactive Translation Demo

See how RNN encoder-decoder models with attention work for machine translation. This demonstrates the attention mechanism we discussed earlier in practice.

Sequence-to-Sequence Attention

Visualizing attention mechanism in sequence-to-sequence models with RNNs.

Model Type

Show Attention WeightsHighlight Coreference

Query:

Whatdidthecatdo?

Input Sequence (LSTM Attention):

The(0.62)cat(1.00)sat(1.00)on(0.37)the(0.38)mat(0.80)because(0.89)it(0.61)was(0.11)tired(0.63)

Attention Weights:

The

0.62

cat

1.00

sat

1.00

0.37

the

0.38

mat

0.80

because

0.89

0.61

was

0.11

tired

0.63

How Attention Works:

The LSTM encoder processes the input sequence and generates attention weights for each word. Higher weights (darker blue) indicate words that are more relevant to answering the query. The decoder uses these weights to focus on important parts of the input.

Code Example: Simple Encoder-Decoder for Translation

python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random

# Simple Encoder-Decoder with Attention for Neural Machine Translation
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

RNNs with Attention Mechanism: A Step Toward Transformers

The attention mechanism, introduced in 2014, was a critical advancement that addressed limitations in the encoder-decoder architecture, particularly for long sequences.

The Problem: Information Bottleneck

In the basic encoder-decoder architecture, the entire source sequence is compressed into a fixed-size vector, creating an information bottleneck.

Attention Mechanism: The Bridge to Transformers

Attention allows the decoder to "focus" on different parts of the source sequence at each decoding step. This was the conceptual breakthrough that led to transformers.

Note: This is encoder-decoder attention between RNNs. In our next lesson on transformers, we'll see how this concept evolved into self-attention, where sequences attend to themselves.

Mathematical Formulation

Calculate alignment scores between decoder state $\mathbf{s}_{t-1}$ and all encoder states $\mathbf{h}_j$ : $e_{tj} = f(\mathbf{s}_{t-1}, \mathbf{h}_j)$
Normalize to get attention weights: $\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T_x}\exp(e_{tk})}$
Calculate context vector as weighted sum: $\mathbf{c}_t = \sum_{j=1}^{T_x} \alpha_{tj} \mathbf{h}_j$
Generate output using context vector and current decoder state: $\mathbf{y}_t = g(\mathbf{s}_t, \mathbf{c}_t)$

The Bridge to Transformers

The attention mechanism was a crucial step toward the transformer architecture:

Eliminated the bottleneck of fixed-size context vectors
Allowed direct connections between distant positions
Introduced the concept of weighted importance between elements
Provided a foundation for self-attention in transformers

Coming up: In our next lesson, we'll see how this encoder-decoder attention evolved into self-attention, where sequences attend to themselves, leading to the revolutionary transformer architecture.

Limitations of RNNs and the Path to Transformers

Despite their innovations, RNNs (even with LSTM/GRU and attention) have several limitations:

Sequential Processing and Limited Context

RNNs process tokens sequentially, making them inherently difficult to parallelize. Even with gating mechanisms, RNNs struggle to maintain very long-range dependencies.

👆 Use the "RNN Comparison" tab in the Training Visualization above to see how different architectures perform on various metrics like training speed, memory usage, and dependency modeling.

Emergence of Transformers

The transformer architecture addressed these limitations by:

Parallelization: Processing all tokens simultaneously
Direct connections: Allowing each position to attend to all positions
Multi-head attention: Capturing different types of relationships
Positional encoding: Maintaining sequence order without recurrence

Summary

In this lesson, we've covered:

The sequential nature of language and why it requires specialized architectures
Vanilla RNN architecture and its limitations
LSTM and GRU cells that address the vanishing gradient problem
Bidirectional RNNs for capturing context from both directions
Applications in language modeling, sentiment analysis, and machine translation
Attention mechanisms that paved the way for transformers
Limitations of RNNs that led to the transformer revolution

RNNs represent a crucial chapter in the evolution of NLP architectures. While they've largely been superseded by transformers for many tasks, understanding RNNs is essential for appreciating the motivations behind modern architectures and for contexts where their sequential nature and efficiency make them still relevant.

In our next lesson, we'll explore transformers in depth, understanding how they revolutionized NLP and enabled the powerful language models we use today.

Practice Exercises

RNN from Scratch:
- Implement a vanilla RNN in PyTorch
- Observe the vanishing gradient problem firsthand
- Compare training stability across different sequence lengths
LSTM Language Model:
- Build a character-level language model using LSTMs
- Generate text samples and analyze coherence
- Experiment with temperature settings in sampling
Sentiment Analysis Comparison:
- Implement sentiment classifiers using:
  - Bag-of-words + Logistic Regression
  - Word embeddings + Vanilla RNN
  - Word embeddings + LSTM
  - Word embeddings + Bidirectional LSTM
- Compare performance and training time
Neural Machine Translation:
- Implement a simple encoder-decoder model for translation
- Add an attention mechanism
- Analyze which source words receive attention for different target words

Additional Resources

Understanding LSTM Networks by Christopher Olah
The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy
Sequence to Sequence Learning with Neural Networks by Sutskever et al.
Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Chung et al.
Deep Learning for NLP and Speech Recognition by Kamath et al. (Chapters 7-9)

NLP Fundamentals: Core Concepts and Architectures

Pre-Transformer Models: RNN, LSTM, and GRU

Overview

Learning Objectives

The Sequential Nature of Language

The Challenge of Variable-Length Input

Analogy: Understanding Music

Why Feed-Forward Networks Fall Short

Recurrent Neural Networks: The Basic Architecture

The Recurrence Mechanism

RNN Architecture Explorer

Vanilla Recurrent Neural Network

Mathematical Formulation

Parameter Sharing

Training RNNs: Backpropagation Through Time (BPTT)

Training Visualization and RNN Comparison

Backpropagation Through Time (BPTT)

Forward Pass

Long Short-Term Memory (LSTM): Solving the Long-Term Dependency Problem

LSTM Architecture

The Gate Mechanism

Mathematical Formulation

Memory Management Analogy

Addressing Long-Term Dependencies

Gated Recurrent Unit (GRU): A Streamlined Alternative

GRU Architecture

Mathematical Formulation

LSTM vs. GRU: Comparison

Bidirectional RNNs: Capturing Context from Both Directions

Bidirectional Architecture

Benefits for NLP Tasks

Example: Disambiguating Word Sense

Common NLP Applications of RNNs

Language Modeling

Code Example: Simple Character-Level Language Model

Sentiment Analysis

Code Example: Sentiment Classification with LSTM

Machine Translation with Encoder-Decoder Architecture

Interactive Translation Demo

Sequence-to-Sequence Attention

Query:

Input Sequence (LSTM Attention):

Attention Weights:

How Attention Works:

Code Example: Simple Encoder-Decoder for Translation

RNNs with Attention Mechanism: A Step Toward Transformers

The Problem: Information Bottleneck

Attention Mechanism: The Bridge to Transformers

Mathematical Formulation

The Bridge to Transformers

Limitations of RNNs and the Path to Transformers

Sequential Processing and Limited Context

Emergence of Transformers

Summary

Practice Exercises

Additional Resources