RNNs & Sequence Models ⏱️
Modeling sequences over time
Understanding Sequential Data
Not all data is static. Language, music, stock prices, and videos unfold over time. Recurrent Neural Networks (RNNs) are designed to process such sequences by maintaining a memory of what came before.
Why Sequences Are Special
Consider reading this sentence. Each word's meaning depends on previous words:
- "The bank was steep" → riverbank
- "The bank was closed" → financial bank
Context accumulates as you read. RNNs capture this accumulation.
The Core Idea: Hidden State
An RNN processes inputs one at a time, updating an internal hidden state that summarizes everything seen so far:
- Read first word → update memory
- Read second word + memory → update memory
- Read third word + memory → update memory
- ... and so on
The hidden state is the network's "working memory."
Unrolling Through Time
Imagine the same network being copied for each time step, connected by the hidden state:
- Time 1: Process "The" → Hidden state H1
- Time 2: Process "cat" + H1 → Hidden state H2
- Time 3: Process "sat" + H2 → Hidden state H3
The same weights are reused at every step—this is what makes RNNs efficient.
The Vanishing Gradient Problem
Basic RNNs have trouble with long sequences. When training, gradients that flow backward through many time steps tend to shrink to nearly zero. The network "forgets" early inputs.
Imagine whispering a message through 100 people—by the end, it's garbled. That's the vanishing gradient problem.
LSTM: Long Short-Term Memory
LSTMs (1997) solve the vanishing gradient problem with a sophisticated memory cell:
- Cell state: A highway for information to flow unchanged across many steps
- Forget gate: Decides what old information to discard
- Input gate: Decides what new information to store
- Output gate: Decides what to output from the cell
Think of it like a conveyor belt with selective additions and removals.
GRU: Gated Recurrent Unit
GRUs (2014) are a simpler alternative with two gates:
- Update gate: Controls how much of the past to keep
- Reset gate: Controls how much of the past to ignore when computing new content
GRUs often perform similarly to LSTMs with fewer parameters.
Bidirectional RNNs
Sometimes the future matters too. In "The movie was good, I really enjoyed it," understanding "good" helps interpret "enjoyed."
Bidirectional RNNs process the sequence both forward and backward, then combine the results. Useful for tasks where you have the complete sequence upfront (not real-time).
Sequence-to-Sequence Models
Many tasks map one sequence to another:
- Translation: English sentence → French sentence
- Summarization: Long article → Short summary
- Speech-to-text: Audio waveform → Text transcript
Encoder-decoder architecture:
- Encoder: Reads input sequence, produces a summary vector
- Decoder: Generates output sequence from the summary
Attention: The Game Changer
The encoder's single summary vector is a bottleneck. Attention allows the decoder to look back at all encoder states:
- For each output word, focus on relevant input words
- "Look" at the right parts of the input dynamically
Attention was so powerful it led to Transformers, which replaced RNNs entirely for many tasks.
Where RNNs Still Shine
Despite Transformers' dominance, RNNs remain useful for:
- Streaming data: Process one element at a time with fixed memory
- Embedded systems: Smaller memory footprint
- Time-series with limited context: When long-range isn't needed
- Research: Understanding sequential dynamics
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.