Preference Alignment and RLHF

Overview

In previous lessons, we covered training language models from scratch, fine-tuning pre-trained models, and distributed training infrastructure. However, even well-trained models may not always behave according to human preferences or values. This lesson explores techniques for aligning language models with human preferences, focusing on methods like RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), and other approaches.

Preference alignment represents a crucial step in developing helpful, harmless, and honest AI systems. These techniques help reduce harmful outputs, make models more helpful, and create systems that better align with human values and expectations.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the fundamental challenges of language model alignment
  • Implement Reinforcement Learning from Human Feedback (RLHF) pipelines
  • Apply Direct Preference Optimization (DPO) and other preference learning methods
  • Design effective data collection processes for human feedback
  • Evaluate alignment quality using appropriate metrics
  • Compare different alignment approaches and their trade-offs

The Alignment Problem

Why Alignment Matters

Language models trained on internet-scale data can generate content that may be harmful, misleading, or misaligned with human values. Alignment techniques aim to address these issues.

Analogy: Alignment as Civic Education

Think of alignment as civic education for AI systems:

  • Pre-training: Like general education (reading, writing, facts about the world)
  • Fine-tuning: Like specialized education (professional skills, domain expertise)
  • Alignment: Like civic and ethical education (social norms, values, ethical conduct)

Just as societies invest in teaching ethics and values to citizens, we need to "teach" AI systems to behave in accordance with human preferences and values.

Types of Misalignment

  1. Goal Misalignment: When model objectives differ from human intentions
  2. Capability Misalignment: When models are trained to maximize capabilities without safety
  3. Distributional Misalignment: When training data distributions differ from deployment contexts

Human Feedback Data Collection

Collecting High-Quality Feedback

The foundation of effective alignment is high-quality human feedback data:

  1. Types of Human Feedback:

    • Ranking preferences between responses
    • Binary judgments (acceptable/unacceptable)
    • Scalar ratings (1-5 stars)
    • Free-form critiques and suggestions
  2. Key Considerations:

    • Annotator diversity and expertise
    • Clear guidelines and calibration
    • Quality control measures
    • Bias mitigation strategies

Example: Anthropic's Constitutional AI Approach

Anthropic's Constitutional AI approach uses a constitution (set of principles) to guide feedback:

  1. Red-teaming: Generate potentially harmful outputs
  2. Constitutional critique: Critique harmful outputs based on principles
  3. Revision: Generate improved responses based on critique
  4. Preference data: Create preference pairs from harmful and revised responses

Reinforcement Learning from Human Feedback (RLHF)

The RLHF Pipeline

RLHF combines reinforcement learning with human feedback:

  1. Pre-trained Language Model: Starting point (SFT model)
  2. Human Preference Data: Pairs of responses with human preferences
  3. Reward Model Training: Learn to predict human preferences
  4. RL Fine-tuning: Optimize policy to maximize predicted reward
Chart Configuration

Reward Modeling

Training a reward model to predict human preferences:

python
import torch import torch.nn as nn from transformers import AutoModelForSequenceClassification, AutoTokenizer class RewardModel(nn.Module): def __init__(self, model_name='gpt2'): super().__init__() self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1) self.tokenizer = AutoTokenizer.from_pretrained(model_name)

Proximal Policy Optimization (PPO)

Using PPO to optimize the language model policy:

python
import torch from transformers import AutoModelForCausalLM, AutoTokenizer from trl import PPOTrainer, PPOConfig def train_with_ppo(policy_model, reward_model, dataset, device): # Configure PPO training ppo_config = PPOConfig( batch_size=8, mini_batch_size=1, learning_rate=1.41e-5,

Challenges with RLHF

  1. Reward Hacking: Models learn to exploit reward model weaknesses
  2. KL Penalty Tuning: Balancing task performance with alignment
  3. Computational Complexity: PPO requires many model calls
  4. Stability Issues: Training can be unstable without careful tuning

Direct Preference Optimization (DPO)

The DPO Approach

DPO simplifies preference alignment by eliminating the need for a separate reward model:

  1. Theoretical Foundation: Reformulates RLHF as direct optimization problem
  2. Direct Training: Uses preference data to directly update policy
  3. Simpler Implementation: Eliminates RL complexity
  4. Comparable Results: Achieves similar performance to RLHF with less complexity

DPO Implementation

python
import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids, beta=0.1): # Get log probs from policy model policy_chosen_logps = compute_logps(policy_model, chosen_ids) policy_rejected_logps = compute_logps(policy_model, rejected_ids) # Get log probs from reference model

Comparing DPO and RLHF

AspectRLHFDPO
ComponentsPolicy model + Reward model + RL algorithmPolicy model + Reference model
Training Steps3 (SFT → RM → PPO)2 (SFT → DPO)
Computational CostHigh (many model calls in PPO)Lower (single forward/backward pass)
Implementation ComplexityHigh (RL setup, reward model)Medium (preference pairs)
StabilityCan be unstableGenerally more stable
ExplorationCan explore beyond training distributionLimited to preference data distribution
PerformanceStrong when tuned properlyComparable to RLHF in many cases

Other Alignment Approaches

Iterative Constitutional AI (ICA)

A self-improvement approach using constitution-guided critique:

  1. Initial Model: Start with SFT model
  2. Red-teaming: Generate problematic responses
  3. Self-critique: Model critiques its own output based on constitutional principles
  4. Self-improvement: Model revises responses based on critique
  5. RLHF/DPO: Train on the preference pairs created through this process

Rejection Sampling and Best-of-N

Simple alignment techniques without RLHF complexity:

  1. Generate Multiple Completions: Create N different outputs
  2. Reward Scoring: Score each with reward model
  3. Select Best: Choose highest-scoring completion
  4. Advantages: Simple, no additional training
  5. Disadvantages: Computationally expensive at inference time

Contrastive Preference Learning (CPL)

An alternative approach focusing on contrastive learning:

  1. Embedding Space: Learn embeddings for responses
  2. Contrastive Signal: Push preferred responses closer together
  3. Implementation: Similar to DPO but in embedding space
  4. Benefits: May generalize better to unseen scenarios

Evaluation and Metrics

Evaluating Alignment Quality

Proper evaluation is crucial for alignment techniques:

  1. Human Evaluation:

    • A/B testing between model versions
    • Absolute quality ratings
    • Specific alignment attribute scoring
  2. Automated Metrics:

    • Reward model scores on holdout data
    • Classification accuracy on helpful/harmful content
    • Benchmark performance (e.g., TruthfulQA, Toxicity)
  3. Capability Preservation:

    • Performance on capability benchmarks
    • Balance between alignment and capabilities

Evaluation Tools

Chart Configuration

Practical Alignment Challenges

Reward Hacking

Models can learn to exploit reward model weaknesses:

  1. Gaming the Metric: Optimizing for proxy rewards rather than true goals
  2. Sycophancy: Telling humans what they want to hear
  3. Hidden Harmful Content: Concealing harmful intent
  4. Detection Methods: Adversarial testing, red-teaming

Balancing Alignment and Capabilities

A key challenge is maintaining capabilities while improving alignment:

  1. KL Divergence Control: Prevent excessive drift from base model
  2. Multi-objective Optimization: Balance multiple goals
  3. Progressive Alignment: Gradually increase alignment pressure
Chart Configuration

Diversity and Representativeness

Ensuring feedback data represents diverse perspectives:

  1. Demographic Diversity: Include annotators from varied backgrounds
  2. Viewpoint Diversity: Incorporate different ethical/political viewpoints
  3. Global Perspectives: Include non-Western cultural norms
  4. Mitigation Strategies: Weight sampling, targeted recruitment

Practical Exercises

Exercise 1: Reward Modeling

Implement a reward model using human preference data:

  1. Prepare a dataset of preferred/rejected response pairs
  2. Train a reward model to predict human preferences
  3. Evaluate model accuracy on holdout preferences
  4. Analyze where the model succeeds and fails

Exercise 2: DPO Implementation

Implement Direct Preference Optimization:

  1. Start with a fine-tuned language model
  2. Create a reference model (frozen copy)
  3. Implement DPO loss function
  4. Train on preference data
  5. Evaluate alignment improvement

Exercise 3: Alignment Evaluation

Design a comprehensive alignment evaluation suite:

  1. Create automated tests for helpfulness, harmlessness, and honesty
  2. Implement capability preservation metrics
  3. Design targeted red-team prompts
  4. Compare different alignment approaches using your evaluation suite

Conclusion

Preference alignment techniques represent a crucial step in developing language models that are not only capable but also aligned with human values and preferences. From RLHF to DPO and constitutional approaches, the field is rapidly evolving with simpler, more efficient methods.

As these models become more powerful and widespread, the importance of proper alignment only grows. The techniques discussed in this lesson provide a foundation for creating AI systems that are helpful, harmless, and honest—systems that augment human capabilities while respecting human values.

In our next lesson, we'll explore comprehensive model evaluation techniques, examining how to assess both the capabilities and alignment of language models using a variety of benchmarks and methodologies.

Additional Resources

Papers

  • "Training Language Models to Follow Instructions with Human Feedback" (OpenAI, 2022) - RLHF, InstructGPT
  • "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov et al., 2023) - DPO
  • "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022) - Constitutional approach
  • "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (Anthropic, 2022) - RLHF

Tools and Libraries

Blog Posts and Tutorials