Preference Alignment and RLHF

Overview

In previous lessons, we covered training language models from scratch, fine-tuning pre-trained models, and distributed training infrastructure. However, even well-trained models may not always behave according to human preferences or values. This lesson explores techniques for aligning language models with human preferences, focusing on methods like RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), and other approaches.

Preference alignment represents a crucial step in developing helpful, harmless, and honest AI systems. These techniques help reduce harmful outputs, make models more helpful, and create systems that better align with human values and expectations.

Learning Objectives

After completing this lesson, you will be able to:

Understand the fundamental challenges of language model alignment
Implement Reinforcement Learning from Human Feedback (RLHF) pipelines
Apply Direct Preference Optimization (DPO) and other preference learning methods
Design effective data collection processes for human feedback
Evaluate alignment quality using appropriate metrics
Compare different alignment approaches and their trade-offs

The Alignment Problem

Why Alignment Matters

Language models trained on internet-scale data can generate content that may be harmful, misleading, or misaligned with human values. Alignment techniques aim to address these issues.

Analogy: Alignment as Civic Education

Think of alignment as civic education for AI systems:

Pre-training: Like general education (reading, writing, facts about the world)
Fine-tuning: Like specialized education (professional skills, domain expertise)
Alignment: Like civic and ethical education (social norms, values, ethical conduct)

Just as societies invest in teaching ethics and values to citizens, we need to "teach" AI systems to behave in accordance with human preferences and values.

Types of Misalignment

Goal Misalignment: When model objectives differ from human intentions
Capability Misalignment: When models are trained to maximize capabilities without safety
Distributional Misalignment: When training data distributions differ from deployment contexts

Human Feedback Data Collection

Collecting High-Quality Feedback

The foundation of effective alignment is high-quality human feedback data:

Types of Human Feedback:
- Ranking preferences between responses
- Binary judgments (acceptable/unacceptable)
- Scalar ratings (1-5 stars)
- Free-form critiques and suggestions
Key Considerations:
- Annotator diversity and expertise
- Clear guidelines and calibration
- Quality control measures
- Bias mitigation strategies

Example: Anthropic's Constitutional AI Approach

Anthropic's Constitutional AI approach uses a constitution (set of principles) to guide feedback:

Red-teaming: Generate potentially harmful outputs
Constitutional critique: Critique harmful outputs based on principles
Revision: Generate improved responses based on critique
Preference data: Create preference pairs from harmful and revised responses

Reinforcement Learning from Human Feedback (RLHF)

The RLHF Pipeline

RLHF combines reinforcement learning with human feedback:

Pre-trained Language Model: Starting point (SFT model)
Human Preference Data: Pairs of responses with human preferences
Reward Model Training: Learn to predict human preferences
RL Fine-tuning: Optimize policy to maximize predicted reward

Chart Configuration

Reward Modeling

Training a reward model to predict human preferences:

python
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class RewardModel(nn.Module):
    def __init__(self, model_name='gpt2'):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

Proximal Policy Optimization (PPO)

Using PPO to optimize the language model policy:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig

def train_with_ppo(policy_model, reward_model, dataset, device):
    # Configure PPO training
    ppo_config = PPOConfig(
        batch_size=8,
        mini_batch_size=1,
        learning_rate=1.41e-5,

Challenges with RLHF

Reward Hacking: Models learn to exploit reward model weaknesses
KL Penalty Tuning: Balancing task performance with alignment
Computational Complexity: PPO requires many model calls
Stability Issues: Training can be unstable without careful tuning

Direct Preference Optimization (DPO)

The DPO Approach

DPO simplifies preference alignment by eliminating the need for a separate reward model:

Theoretical Foundation: Reformulates RLHF as direct optimization problem
Direct Training: Uses preference data to directly update policy
Simpler Implementation: Eliminates RL complexity
Comparable Results: Achieves similar performance to RLHF with less complexity

DPO Implementation

python
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids, beta=0.1):
    # Get log probs from policy model
    policy_chosen_logps = compute_logps(policy_model, chosen_ids)
    policy_rejected_logps = compute_logps(policy_model, rejected_ids)
    
    # Get log probs from reference model

Comparing DPO and RLHF

Aspect	RLHF	DPO
Components	Policy model + Reward model + RL algorithm	Policy model + Reference model
Training Steps	3 (SFT → RM → PPO)	2 (SFT → DPO)
Computational Cost	High (many model calls in PPO)	Lower (single forward/backward pass)
Implementation Complexity	High (RL setup, reward model)	Medium (preference pairs)
Stability	Can be unstable	Generally more stable
Exploration	Can explore beyond training distribution	Limited to preference data distribution
Performance	Strong when tuned properly	Comparable to RLHF in many cases

Other Alignment Approaches

Iterative Constitutional AI (ICA)

A self-improvement approach using constitution-guided critique:

Initial Model: Start with SFT model
Red-teaming: Generate problematic responses
Self-critique: Model critiques its own output based on constitutional principles
Self-improvement: Model revises responses based on critique
RLHF/DPO: Train on the preference pairs created through this process

Rejection Sampling and Best-of-N

Simple alignment techniques without RLHF complexity:

Generate Multiple Completions: Create N different outputs
Reward Scoring: Score each with reward model
Select Best: Choose highest-scoring completion
Advantages: Simple, no additional training
Disadvantages: Computationally expensive at inference time

Contrastive Preference Learning (CPL)

An alternative approach focusing on contrastive learning:

Embedding Space: Learn embeddings for responses
Contrastive Signal: Push preferred responses closer together
Implementation: Similar to DPO but in embedding space
Benefits: May generalize better to unseen scenarios

Evaluation and Metrics

Evaluating Alignment Quality

Proper evaluation is crucial for alignment techniques:

Human Evaluation:
- A/B testing between model versions
- Absolute quality ratings
- Specific alignment attribute scoring
Automated Metrics:
- Reward model scores on holdout data
- Classification accuracy on helpful/harmful content
- Benchmark performance (e.g., TruthfulQA, Toxicity)
Capability Preservation:
- Performance on capability benchmarks
- Balance between alignment and capabilities

Evaluation Tools

Chart Configuration

Practical Alignment Challenges

Reward Hacking

Models can learn to exploit reward model weaknesses:

Gaming the Metric: Optimizing for proxy rewards rather than true goals
Sycophancy: Telling humans what they want to hear
Hidden Harmful Content: Concealing harmful intent
Detection Methods: Adversarial testing, red-teaming

Balancing Alignment and Capabilities

A key challenge is maintaining capabilities while improving alignment:

KL Divergence Control: Prevent excessive drift from base model
Multi-objective Optimization: Balance multiple goals
Progressive Alignment: Gradually increase alignment pressure

Chart Configuration

Diversity and Representativeness

Ensuring feedback data represents diverse perspectives:

Demographic Diversity: Include annotators from varied backgrounds
Viewpoint Diversity: Incorporate different ethical/political viewpoints
Global Perspectives: Include non-Western cultural norms
Mitigation Strategies: Weight sampling, targeted recruitment

Practical Exercises

Exercise 1: Reward Modeling

Implement a reward model using human preference data:

Prepare a dataset of preferred/rejected response pairs
Train a reward model to predict human preferences
Evaluate model accuracy on holdout preferences
Analyze where the model succeeds and fails

Exercise 2: DPO Implementation

Implement Direct Preference Optimization:

Start with a fine-tuned language model
Create a reference model (frozen copy)
Implement DPO loss function
Train on preference data
Evaluate alignment improvement

Exercise 3: Alignment Evaluation

Design a comprehensive alignment evaluation suite:

Create automated tests for helpfulness, harmlessness, and honesty
Implement capability preservation metrics
Design targeted red-team prompts
Compare different alignment approaches using your evaluation suite

Conclusion

Preference alignment techniques represent a crucial step in developing language models that are not only capable but also aligned with human values and preferences. From RLHF to DPO and constitutional approaches, the field is rapidly evolving with simpler, more efficient methods.

As these models become more powerful and widespread, the importance of proper alignment only grows. The techniques discussed in this lesson provide a foundation for creating AI systems that are helpful, harmless, and honest—systems that augment human capabilities while respecting human values.

In our next lesson, we'll explore comprehensive model evaluation techniques, examining how to assess both the capabilities and alignment of language models using a variety of benchmarks and methodologies.

Additional Resources

Papers

"Training Language Models to Follow Instructions with Human Feedback" (OpenAI, 2022) - RLHF, InstructGPT
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov et al., 2023) - DPO
"Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022) - Constitutional approach
"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (Anthropic, 2022) - RLHF

Tools and Libraries

TRL (Transformer Reinforcement Learning) - RLHF/DPO implementation
DeepSpeed-Chat - Efficient RLHF
AlignmentLab - Alignment research resources

Blog Posts and Tutorials

"Understanding RLHF" by Hugging Face
"Illustrating Direct Preference Optimization"
"Constitutional AI: Alignment via Natural Language Feedback" by Anthropic

Advanced NLP: Training & Production Systems

Preference Alignment and RLHF

Overview

Learning Objectives

The Alignment Problem

Why Alignment Matters

Analogy: Alignment as Civic Education

Types of Misalignment

Human Feedback Data Collection

Collecting High-Quality Feedback

Example: Anthropic's Constitutional AI Approach

Reinforcement Learning from Human Feedback (RLHF)

The RLHF Pipeline

Reward Modeling

Proximal Policy Optimization (PPO)

Challenges with RLHF

Direct Preference Optimization (DPO)

The DPO Approach

DPO Implementation

Comparing DPO and RLHF

Other Alignment Approaches

Iterative Constitutional AI (ICA)

Rejection Sampling and Best-of-N

Contrastive Preference Learning (CPL)

Evaluation and Metrics

Evaluating Alignment Quality

Evaluation Tools

Practical Alignment Challenges

Reward Hacking

Balancing Alignment and Capabilities

Diversity and Representativeness

Practical Exercises

Exercise 1: Reward Modeling

Exercise 2: DPO Implementation

Exercise 3: Alignment Evaluation

Conclusion

Additional Resources

Papers

Tools and Libraries

Blog Posts and Tutorials