RNNs & Sequence Models ⏱️

Modeling sequences over time

Understanding Sequential Data

Not all data is static. Language, music, stock prices, and videos unfold over time. Recurrent Neural Networks (RNNs) are designed to process such sequences by maintaining a memory of what came before.

Why Sequences Are Special

Consider reading this sentence. Each word's meaning depends on previous words:

"The bank was steep" → riverbank
"The bank was closed" → financial bank

Context accumulates as you read. RNNs capture this accumulation.

The Core Idea: Hidden State

An RNN processes inputs one at a time, updating an internal hidden state that summarizes everything seen so far:

Read first word → update memory
Read second word + memory → update memory
Read third word + memory → update memory
... and so on

The hidden state is the network's "working memory."

Unrolling Through Time

Imagine the same network being copied for each time step, connected by the hidden state:

Time 1: Process "The" → Hidden state H1
Time 2: Process "cat" + H1 → Hidden state H2
Time 3: Process "sat" + H2 → Hidden state H3

The same weights are reused at every step—this is what makes RNNs efficient.

The Vanishing Gradient Problem

Basic RNNs have trouble with long sequences. When training, gradients that flow backward through many time steps tend to shrink to nearly zero. The network "forgets" early inputs.

Imagine whispering a message through 100 people—by the end, it's garbled. That's the vanishing gradient problem.

LSTM: Long Short-Term Memory

LSTMs (1997) solve the vanishing gradient problem with a sophisticated memory cell:

Cell state: A highway for information to flow unchanged across many steps
Forget gate: Decides what old information to discard
Input gate: Decides what new information to store
Output gate: Decides what to output from the cell

Think of it like a conveyor belt with selective additions and removals.

GRU: Gated Recurrent Unit

GRUs (2014) are a simpler alternative with two gates:

Update gate: Controls how much of the past to keep
Reset gate: Controls how much of the past to ignore when computing new content

GRUs often perform similarly to LSTMs with fewer parameters.

Bidirectional RNNs

Sometimes the future matters too. In "The movie was good, I really enjoyed it," understanding "good" helps interpret "enjoyed."

Bidirectional RNNs process the sequence both forward and backward, then combine the results. Useful for tasks where you have the complete sequence upfront (not real-time).

Sequence-to-Sequence Models

Many tasks map one sequence to another:

Translation: English sentence → French sentence
Summarization: Long article → Short summary
Speech-to-text: Audio waveform → Text transcript

Encoder-decoder architecture:

Encoder: Reads input sequence, produces a summary vector
Decoder: Generates output sequence from the summary

Attention: The Game Changer

The encoder's single summary vector is a bottleneck. Attention allows the decoder to look back at all encoder states:

For each output word, focus on relevant input words
"Look" at the right parts of the input dynamically

Attention was so powerful it led to Transformers, which replaced RNNs entirely for many tasks.

Where RNNs Still Shine

Despite Transformers' dominance, RNNs remain useful for:

Streaming data: Process one element at a time with fixed memory
Embedded systems: Smaller memory footprint
Time-series with limited context: When long-range isn't needed
Research: Understanding sequential dynamics

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn