Reinforcement Learning
Learning by acting
Learning Through Experience
Reinforcement Learning (RL) is fundamentally different from supervised learning. Instead of learning from labeled examples, an RL agent learns by interacting with an environment, receiving rewards, and figuring out which actions lead to the best outcomes.
The Core Framework
Every RL problem has the same basic structure:
- Agent: The learner/decision-maker (e.g., a robot, game player)
- Environment: The world the agent interacts with
- State: Current situation (what the agent observes)
- Action: What the agent can do
- Reward: Feedback signal (higher = better)
At each step: Agent sees state → takes action → environment updates → agent receives reward → repeat.
The Goal
The agent's goal is to maximize cumulative reward over time, not just immediate reward. This is the key insight: sometimes you need to sacrifice short-term gains for long-term success.
Exploration vs. Exploitation
One of RL's central challenges:
- Exploitation: Do what you know works well
- Exploration: Try new things to potentially find better strategies
Pure exploitation might miss better options. Pure exploration never uses what you've learned. Good RL balances both.
Policies: The Agent's Strategy
A policy maps states to actions—it's the agent's decision-making strategy.
- Deterministic policy: "In state X, always do action A"
- Stochastic policy: "In state X, do A with 70% probability, B with 30%"
Training an RL agent means finding a good policy.
Value Functions: Predicting Future Rewards
Value functions estimate how good a state (or state-action pair) is:
- State value V(s): Expected total reward starting from state s
- Action value Q(s,a): Expected total reward after taking action a in state s
If you know Q(s,a) for all state-action pairs, the optimal policy is simple: always choose the action with highest Q value!
Model-Free vs. Model-Based RL
Model-free: Learn directly from experience
- Don't try to understand how the environment works
- Just learn what actions give good rewards
- Examples: Q-Learning, Policy Gradients
Model-based: Learn how the environment works
- Build a model that predicts next states and rewards
- Use the model to plan ahead
- Can be more sample-efficient
Classic Algorithms
Q-Learning: Learn Q(s,a) values directly
- Update Q values based on observed rewards
- Off-policy: can learn from past experience
SARSA: Similar to Q-Learning but on-policy
- Updates based on actions actually taken
Policy Gradients: Learn the policy directly
- Adjust policy to increase probability of good actions
- Works with continuous action spaces
Deep Reinforcement Learning
When state spaces are huge (like images), use neural networks:
DQN (Deep Q-Network): Neural network approximates Q function
- Famously learned to play Atari games from pixels
- Uses experience replay and target networks for stability
PPO (Proximal Policy Optimization): Stable policy gradient method
- Used in OpenAI Five (Dota 2) and ChatGPT's RLHF
- Prevents too-large policy updates
SAC (Soft Actor-Critic): Maximum entropy RL
- Encourages exploration through entropy bonus
- Great for continuous control tasks
Famous RL Successes
- AlphaGo/AlphaZero: Beat world champions at Go, chess, shogi
- OpenAI Five: Defeated professional Dota 2 teams
- Robotics: Learning dexterous manipulation
- RLHF: Aligning language models with human preferences
Challenges
- Sample efficiency: Often needs millions of interactions
- Reward design: Hard to specify exactly what you want
- Stability: Training can be unstable and sensitive
- Sim-to-real: Policies learned in simulation may not transfer
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.