Reinforcement Learning

Learning by acting

Learning Through Experience

Reinforcement Learning (RL) is fundamentally different from supervised learning. Instead of learning from labeled examples, an RL agent learns by interacting with an environment, receiving rewards, and figuring out which actions lead to the best outcomes.

The Core Framework

Every RL problem has the same basic structure:

Agent: The learner/decision-maker (e.g., a robot, game player)
Environment: The world the agent interacts with
State: Current situation (what the agent observes)
Action: What the agent can do
Reward: Feedback signal (higher = better)

At each step: Agent sees state → takes action → environment updates → agent receives reward → repeat.

The Goal

The agent's goal is to maximize cumulative reward over time, not just immediate reward. This is the key insight: sometimes you need to sacrifice short-term gains for long-term success.

Exploration vs. Exploitation

One of RL's central challenges:

Exploitation: Do what you know works well
Exploration: Try new things to potentially find better strategies

Pure exploitation might miss better options. Pure exploration never uses what you've learned. Good RL balances both.

Policies: The Agent's Strategy

A policy maps states to actions—it's the agent's decision-making strategy.

Deterministic policy: "In state X, always do action A"
Stochastic policy: "In state X, do A with 70% probability, B with 30%"

Training an RL agent means finding a good policy.

Value Functions: Predicting Future Rewards

Value functions estimate how good a state (or state-action pair) is:

State value V(s): Expected total reward starting from state s
Action value Q(s,a): Expected total reward after taking action a in state s

If you know Q(s,a) for all state-action pairs, the optimal policy is simple: always choose the action with highest Q value!

Model-Free vs. Model-Based RL

Model-free: Learn directly from experience

Don't try to understand how the environment works
Just learn what actions give good rewards
Examples: Q-Learning, Policy Gradients

Model-based: Learn how the environment works

Build a model that predicts next states and rewards
Use the model to plan ahead
Can be more sample-efficient

Classic Algorithms

Q-Learning: Learn Q(s,a) values directly

Update Q values based on observed rewards
Off-policy: can learn from past experience

SARSA: Similar to Q-Learning but on-policy

Updates based on actions actually taken

Policy Gradients: Learn the policy directly

Adjust policy to increase probability of good actions
Works with continuous action spaces

Deep Reinforcement Learning

When state spaces are huge (like images), use neural networks:

DQN (Deep Q-Network): Neural network approximates Q function

Famously learned to play Atari games from pixels
Uses experience replay and target networks for stability

PPO (Proximal Policy Optimization): Stable policy gradient method

Used in OpenAI Five (Dota 2) and ChatGPT's RLHF
Prevents too-large policy updates

SAC (Soft Actor-Critic): Maximum entropy RL

Encourages exploration through entropy bonus
Great for continuous control tasks

Famous RL Successes

AlphaGo/AlphaZero: Beat world champions at Go, chess, shogi
OpenAI Five: Defeated professional Dota 2 teams
Robotics: Learning dexterous manipulation
RLHF: Aligning language models with human preferences

Challenges

Sample efficiency: Often needs millions of interactions
Reward design: Hard to specify exactly what you want
Stability: Training can be unstable and sensitive
Sim-to-real: Policies learned in simulation may not transfer

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn