Fine-Tuning & RLHF
Adapt and align
From Generic to Specialized
Pre-trained language models are generalists—they can do many things reasonably well. But for specific applications, we often want better performance or particular behaviors. That's where fine-tuning and alignment come in.
The Training Pipeline
Modern LLMs go through multiple stages:
- Pre-training: Learn language from massive text (terabytes of web data)
- Supervised Fine-Tuning (SFT): Learn to follow instructions from curated examples
- RLHF/DPO: Learn to generate preferred responses using human feedback
- Optional: Task-specific fine-tuning: Specialize for particular domains
Each stage builds on the previous one.
Supervised Fine-Tuning (SFT)
SFT teaches the model to follow instructions by training on example conversations:
[User]: Summarize this article in 3 bullet points...
[Assistant]: • First key point...
• Second insight...
• Third takeaway...
The model learns the format and style of helpful responses. This is like showing an employee examples of good work.
Why SFT Isn't Enough
SFT has limitations:
- Only teaches what good responses look like
- Doesn't teach what to avoid
- Model may still generate harmful or unhelpful content
- Doesn't capture nuanced human preferences
We need something that says "this response is better than that one."
Reinforcement Learning from Human Feedback (RLHF)
RLHF adds a preference learning stage:
- Collect comparisons: Show humans two responses, ask which is better
- Train reward model: Learn to predict human preferences
- Optimize with RL: Adjust the LLM to maximize predicted reward
This aligns the model with human values and preferences.
The Reward Model
The reward model (RM) scores responses:
- Input: Prompt + Response
- Output: Score (higher = more preferred)
Trained on thousands of human comparison judgments:
- "Response A is better than Response B"
- "Response B is better than Response A"
Policy Optimization (PPO)
PPO (Proximal Policy Optimization) updates the LLM:
- Generate responses to prompts
- Score them with the reward model
- Adjust weights to increase scores
- Add a KL penalty to not drift too far from the original model
The KL penalty is crucial—without it, the model might "hack" the reward model.
Constitutional AI (CAI)
An alternative that reduces human labeling:
- Write principles (a "constitution")
- Generate responses, then critique them against principles
- Use AI-generated preferences to train
Example principle: "Choose the response that is more helpful while being harmless."
Direct Preference Optimization (DPO)
DPO simplifies RLHF by skipping the reward model:
- Take preference data directly
- Optimize a special objective that implicitly fits a reward
- Simpler, more stable training
Becoming increasingly popular as an alternative to full RLHF.
Parameter-Efficient Fine-Tuning (PEFT)
Fine-tuning all parameters is expensive. PEFT methods freeze most weights:
LoRA (Low-Rank Adaptation):
- Add small trainable matrices beside frozen weights
- Train only these additions (0.1-1% of parameters)
- Merge after training for zero inference overhead
Prefix Tuning:
- Learn special tokens prepended to the input
- Original model weights unchanged
Adapters:
- Insert small networks between frozen layers
When to Fine-Tune vs. Use RAG
Fine-tune when:
- You need a specific style or format
- The task is well-defined and consistent
- You have good training data
- Response patterns should be "baked in"
Use RAG when:
- You need current/private information
- Facts may change over time
- You want source attribution
- You have a large knowledge base
Often the best solution uses both!
Evaluation
How do you know if fine-tuning worked?
Automated metrics: Perplexity, task-specific accuracy Human evaluation: Quality ratings, preference tests Red-teaming: Adversarial testing for failures Benchmarks: Standard datasets for comparison
Safety Considerations
Fine-tuning can break safety guardrails:
- Malicious fine-tuning can remove restrictions
- Even well-intentioned training may have side effects
- Regular safety testing is essential
Responsible fine-tuning preserves helpful behaviors while adding capabilities.
References
Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.