Fine-Tuning & RLHF

Adapt and align

From Generic to Specialized

Pre-trained language models are generalists—they can do many things reasonably well. But for specific applications, we often want better performance or particular behaviors. That's where fine-tuning and alignment come in.

The Training Pipeline

Modern LLMs go through multiple stages:

Pre-training: Learn language from massive text (terabytes of web data)
Supervised Fine-Tuning (SFT): Learn to follow instructions from curated examples
RLHF/DPO: Learn to generate preferred responses using human feedback
Optional: Task-specific fine-tuning: Specialize for particular domains

Each stage builds on the previous one.

Supervised Fine-Tuning (SFT)

SFT teaches the model to follow instructions by training on example conversations:

[User]: Summarize this article in 3 bullet points...
[Assistant]: • First key point...
 • Second insight...
 • Third takeaway...

The model learns the format and style of helpful responses. This is like showing an employee examples of good work.

Why SFT Isn't Enough

SFT has limitations:

Only teaches what good responses look like
Doesn't teach what to avoid
Model may still generate harmful or unhelpful content
Doesn't capture nuanced human preferences

We need something that says "this response is better than that one."

Reinforcement Learning from Human Feedback (RLHF)

RLHF adds a preference learning stage:

Collect comparisons: Show humans two responses, ask which is better
Train reward model: Learn to predict human preferences
Optimize with RL: Adjust the LLM to maximize predicted reward

This aligns the model with human values and preferences.

The Reward Model

The reward model (RM) scores responses:

Input: Prompt + Response
Output: Score (higher = more preferred)

Trained on thousands of human comparison judgments:

"Response A is better than Response B"
"Response B is better than Response A"

Policy Optimization (PPO)

PPO (Proximal Policy Optimization) updates the LLM:

Generate responses to prompts
Score them with the reward model
Adjust weights to increase scores
Add a KL penalty to not drift too far from the original model

The KL penalty is crucial—without it, the model might "hack" the reward model.

Constitutional AI (CAI)

An alternative that reduces human labeling:

Write principles (a "constitution")
Generate responses, then critique them against principles
Use AI-generated preferences to train

Example principle: "Choose the response that is more helpful while being harmless."

Direct Preference Optimization (DPO)

DPO simplifies RLHF by skipping the reward model:

Take preference data directly
Optimize a special objective that implicitly fits a reward
Simpler, more stable training

Becoming increasingly popular as an alternative to full RLHF.

Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning all parameters is expensive. PEFT methods freeze most weights:

LoRA (Low-Rank Adaptation):

Add small trainable matrices beside frozen weights
Train only these additions (0.1-1% of parameters)
Merge after training for zero inference overhead

Prefix Tuning:

Learn special tokens prepended to the input
Original model weights unchanged

Adapters:

Insert small networks between frozen layers

When to Fine-Tune vs. Use RAG

Fine-tune when:

You need a specific style or format
The task is well-defined and consistent
You have good training data
Response patterns should be "baked in"

Use RAG when:

You need current/private information
Facts may change over time
You want source attribution
You have a large knowledge base

Often the best solution uses both!

Evaluation

How do you know if fine-tuning worked?

Automated metrics: Perplexity, task-specific accuracy Human evaluation: Quality ratings, preference tests Red-teaming: Adversarial testing for failures Benchmarks: Standard datasets for comparison

Safety Considerations

Fine-tuning can break safety guardrails:

Malicious fine-tuning can remove restrictions
Even well-intentioned training may have side effects
Regular safety testing is essential

Responsible fine-tuning preserves helpful behaviors while adding capabilities.

References

Citation Note: All referenced papers are open access. We encourage readers to explore the original research for deeper understanding. If you notice any citation errors, please let us know.

← Back to Learn