Fine-tuning Techniques and Parameter-Efficient Methods

Overview

In our previous lessons, we've explored how to train language models from scratch and how to monitor training and engineer datasets. However, training models from scratch is resource-intensive and often unnecessary. Fine-tuning existing pre-trained models is a more efficient approach for most applications.

This lesson focuses on fine-tuning techniques for large language models, with special emphasis on parameter-efficient methods. As models grow to billions of parameters, traditional fine-tuning becomes prohibitively expensive. We'll explore how methods like LoRA, QLoRA, and other PEFT (Parameter-Efficient Fine-Tuning) approaches make it possible to adapt these massive models with limited computational resources.

Learning Objectives

After completing this lesson, you will be able to:

Understand the differences between pre-training and fine-tuning
Implement full fine-tuning for smaller models
Apply parameter-efficient fine-tuning techniques like LoRA and adapters
Select appropriate fine-tuning strategies based on available resources
Diagnose and fix common fine-tuning issues
Evaluate fine-tuned models effectively

From Pre-training to Fine-tuning

The Two-phase Learning Paradigm

Modern NLP follows a two-phase approach:

Pre-training: Learning general language patterns from vast amounts of data
Fine-tuning: Adapting the pre-trained model to specific tasks or domains

Analogy: Fine-tuning as Specialized Education

Think of pre-training and fine-tuning as education stages:

Pre-training: General education that builds foundational knowledge (like K-12 and undergraduate studies)
Fine-tuning: Specialized training for specific professions (like medical school, law school, or vocational training)

Just as a medical student builds upon general knowledge to develop specialized skills, fine-tuning builds upon a pre-trained model's general language understanding to develop task-specific capabilities.

Why Fine-tune?

Optimization Tradeoffs

This visualization shows the tradeoff between different dataset properties as filtering strictness increases. As the filtering becomes more strict (moving right), the dataset size and diversity decrease while the quality increases.

Key insights:

Optimal filtering balances data quality with quantity and diversity
Over-filtering can severely reduce dataset size and diversity
Under-filtering leads to lower quality data that may harm model performance
The vertical purple line indicates the theoretical optimum balance point

Full Fine-tuning: The Traditional Approach

How Full Fine-tuning Works

Full fine-tuning updates all parameters of a pre-trained model on a downstream task:

Initialize with pre-trained weights
Add task-specific head if needed (e.g., classification layer)
Train on task-specific data with a lower learning rate
Update all parameters throughout the network

Implementing Full Fine-tuning

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load pre-trained model
model_name = 'bert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare dataset (example: IMDB sentiment analysis)

Challenges with Full Fine-tuning

As models grow larger, full fine-tuning faces significant challenges:

Memory Requirements:
- A 7B parameter model in FP16 requires ~14GB just to store
- Backpropagation requires additional memory for gradients and optimizer states
- A rule of thumb: need 3-4x model size in GPU memory
Computational Cost:
- Training cost scales linearly with parameter count
- Fine-tuning 175B parameter models can cost thousands of dollars
Catastrophic Forgetting:
- Aggressive fine-tuning can cause the model to "forget" general capabilities
- Finding the right balance is challenging

Parameter-Efficient Fine-tuning (PEFT)

The PEFT Revolution

Parameter-Efficient Fine-Tuning methods fine-tune only a small subset of parameters while keeping most of the pre-trained model frozen.

Analogy: PEFT as Adding Specialized Tools

Think of PEFT as adding specialized tools to a well-equipped workshop:

The workshop (pre-trained model) already has general-purpose tools
Instead of rebuilding the entire workshop, you add a few specialized tools (trainable parameters)
These specialized tools enable specific tasks while leveraging the existing equipment

Core PEFT Methods

Chart Configuration

Adapter-based Methods

How Adapters Work

Adapters are small neural network modules inserted between layers of a pre-trained model:

Freeze the pre-trained model parameters
Insert adapter modules after certain layers (typically attention or feed-forward)
Train only the adapter parameters
Adapters typically use bottleneck architecture to limit parameter count

Adapter Architecture

Adapters typically use a bottleneck architecture:

Down-project to a small dimension (e.g., 64)
Apply non-linearity (e.g., ReLU or GELU)
Up-project back to original dimension
Add a residual connection

python
import torch
import torch.nn as nn

class Adapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64):
        super().__init__()
        self.down_project = nn.Linear(input_dim, bottleneck_dim)
        self.activation = nn.GELU()
        self.up_project = nn.Linear(bottleneck_dim, input_dim)
        self.layer_norm = nn.LayerNorm(input_dim)

Implementing Adapters with Transformers

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers.adapters import AdapterConfig, PfeifferConfig
from datasets import load_dataset

# Load pre-trained model
model_name = 'bert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add and activate adapters

Low-Rank Adaptation (LoRA)

The LoRA Principle

LoRA is based on a key insight: the updates to pre-trained weights during fine-tuning often have a low "intrinsic rank".

Analogy: LoRA as Efficient Communication

Think of LoRA like compressing a high-resolution image:

Instead of sending the full image (all parameter updates), you send a compressed version
The compression works by capturing the most important patterns
You can reconstruct a close approximation to the original image with much less data

How LoRA Works

Freeze the pre-trained model weights
For selected weight matrices, learn low-rank update matrices
The original operation Y = WX becomes Y = WX + ∆WX where:
- W is the frozen pre-trained weight
- ∆W = BA is the low-rank update (rank r)
- B is a matrix of shape [original_dim, r]
- A is a matrix of shape [r, original_dim]

Implementing LoRA

python
import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=32):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

LoRA with PEFT Library

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load pre-trained model
model_name = 'facebook/opt-1.3b'  # Using a 1.3B parameter model as example
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA configuration

Quantized LoRA (QLoRA)

Combining Quantization and LoRA

QLoRA combines two powerful techniques:

Quantization: Reduces the precision of model weights (e.g., from FP16 to 4-bit)
LoRA: Adds trainable low-rank adapters

Why QLoRA Works

Memory Efficiency:
- 4-bit quantization reduces memory footprint by 4x compared to FP16
- Only small LoRA modules are kept in higher precision for training
Minimal Performance Loss:
- Novel quantization techniques like Double Quantization minimize precision loss
- LoRA updates compensate for any quantization artifacts

Chart Configuration

Implementing QLoRA

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
import torch
from datasets import load_dataset

# Load pre-trained model in 4-bit quantization
model_name = 'meta-llama/Llama-2-7b-hf'  # Example with a 7B parameter model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # Load model in 4-bit precision

Other PEFT Methods

Prefix Tuning

Prefix tuning prepends trainable vectors (virtual tokens) to the input of each transformer layer:

Freeze the pre-trained model
Add trainable prefix tokens to each layer
These prefix tokens influence the model's behavior through attention

Prompt Tuning and P-Tuning

Prompt Tuning: Adds trainable tokens only to the input layer
P-Tuning: Uses a small neural network to generate soft prompts

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

A highly parameter-efficient method that scales activations with learned vectors:

Requires minimal additional parameters (often <0.1%)
Simple element-wise multiplication operation
Often works well for cross-lingual transfer

Practical Considerations for Fine-tuning

Selecting the Right Method

Optimization Tradeoffs

Key insights:

Optimal filtering balances data quality with quantity and diversity
Over-filtering can severely reduce dataset size and diversity
Under-filtering leads to lower quality data that may harm model performance
The vertical purple line indicates the theoretical optimum balance point

Decision Framework

Use this framework to select the appropriate fine-tuning method:

When to use Full Fine-tuning:
- Smaller models (<1B parameters)
- Abundant computational resources
- Need maximum performance
When to use LoRA/Adapters:
- Medium to large models (1B-13B parameters)
- Limited but substantial resources
- Need balance of performance and efficiency
When to use QLoRA:
- Very large models (>7B parameters)
- Highly constrained resources
- Consumer-grade hardware
When to use Prefix/Prompt Tuning:
- Extremely large models
- Minimal resources
- Acceptable performance trade-off

Hyperparameter Considerations

Key hyperparameters for PEFT methods:

LoRA-specific:
- Rank (r): Higher values give better performance but use more parameters
- Alpha (α): Scaling factor, typically set to 2r
- Target modules: Which layers to apply LoRA to
Adapter-specific:
- Bottleneck dimension: Controls adapter size
- Adapter placement: Which layers to add adapters to
General fine-tuning:
- Learning rate: Typically lower for fine-tuning (1e-5 to 5e-5)
- Weight decay: Helps prevent overfitting (0.01 to 0.1)
- Training epochs: Often fewer for fine-tuning (2-5)

Avoiding Catastrophic Forgetting

Strategies to preserve general capabilities:

Use lower learning rates
Implement early stopping
Apply regularization techniques
Balance task-specific data with general data
Consider multi-task fine-tuning

Advanced Topics in Fine-tuning

Domain Adaptation vs. Task Adaptation

Domain Adaptation:
- Adapts to a specific domain (e.g., medical, legal)
- Preserves general capabilities
- Often requires continued pre-training
Task Adaptation:
- Focuses on specific tasks (e.g., classification, summarization)
- May specialize at the expense of generality
- Typically uses supervised fine-tuning

Instruction Tuning

Fine-tuning models on instruction-following data:

Input format: Typically uses a template like "Instruction: {instruction}\nInput: {input}\nOutput:"
Dataset composition: Mix of different task types and formats
Evaluation: Measures ability to follow diverse instructions

python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset
import torch

# Load model and tokenizer
model_name = 'facebook/opt-1.3b'  # Using a 1.3B parameter model as example
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Multi-task Fine-tuning

Training on multiple tasks simultaneously:

Benefits:
- Improves generalization
- Prevents overfitting to a single task
- Reduces catastrophic forgetting
Implementation:
- Collect datasets for multiple tasks
- Balance task representation
- Add task-specific identifiers or prompts

Continual Learning and Sequential Fine-tuning

Strategies for learning new tasks without forgetting:

Elastic Weight Consolidation (EWC):
- Identifies important parameters for previous tasks
- Penalizes changes to these parameters when learning new tasks
Knowledge Distillation:
- Uses original model as teacher
- Prevents new model from diverging too far
Replay Methods:
- Maintains a buffer of examples from previous tasks
- Intermixes these with new task examples during training

Practical Exercises

Exercise 1: LoRA Fine-tuning

Implement LoRA fine-tuning for a sentiment classification task:

Load a pre-trained model (e.g., BERT or RoBERTa)
Configure LoRA adapters
Fine-tune on a sentiment dataset (e.g., SST-2 or IMDB)
Evaluate performance and parameter efficiency

Exercise 2: QLoRA for Large Models

Use QLoRA to fine-tune a large language model (>7B parameters) on a single GPU:

Set up 4-bit quantization
Configure LoRA adapters
Fine-tune on an instruction dataset
Compare performance before and after fine-tuning

Exercise 3: Method Comparison

Compare different PEFT methods on the same task:

Implement Full Fine-tuning, LoRA, Adapters, and Prefix Tuning
Train each method with the same dataset and hyperparameters
Analyze performance, memory usage, and training time
Recommend the best method for different scenarios

Conclusion

Parameter-efficient fine-tuning methods have democratized access to large language models, making it possible to adapt billion-parameter models with limited resources. These techniques not only reduce computational requirements but often provide comparable performance to full fine-tuning.

As models continue to grow, PEFT methods will become increasingly important. The rapid pace of innovation in this area—from adapters to LoRA to QLoRA—suggests that even more efficient techniques may emerge in the future, further lowering the barrier to working with advanced language models.

In our next lesson, we will explore distributed training infrastructure, enabling you to work with even larger models across multiple devices or machines.

Additional Resources

Papers

"LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
"QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
"Parameter-Efficient Transfer Learning for NLP" (Houlsby et al., 2019, Adapters)
"The Power of Scale for Parameter-Efficient Prompt Tuning" (Lester et al., 2021)

Libraries and Tools

PEFT Library by Hugging Face
Adapter-Transformers
bitsandbytes for quantization

Advanced NLP: Training & Production Systems

Fine-tuning Techniques and Parameter-Efficient Methods

Overview

Learning Objectives

From Pre-training to Fine-tuning

The Two-phase Learning Paradigm

Analogy: Fine-tuning as Specialized Education

Why Fine-tune?

Optimization Tradeoffs

Full Fine-tuning: The Traditional Approach

How Full Fine-tuning Works

Implementing Full Fine-tuning

Challenges with Full Fine-tuning

Parameter-Efficient Fine-tuning (PEFT)

The PEFT Revolution

Analogy: PEFT as Adding Specialized Tools

Core PEFT Methods

Adapter-based Methods

How Adapters Work

Adapter Architecture

Implementing Adapters with Transformers

Low-Rank Adaptation (LoRA)

The LoRA Principle

Analogy: LoRA as Efficient Communication

How LoRA Works

Implementing LoRA

LoRA with PEFT Library

Quantized LoRA (QLoRA)

Combining Quantization and LoRA

Why QLoRA Works

Implementing QLoRA

Other PEFT Methods

Prefix Tuning

Prompt Tuning and P-Tuning

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

Practical Considerations for Fine-tuning

Selecting the Right Method

Optimization Tradeoffs

Decision Framework

Hyperparameter Considerations

Avoiding Catastrophic Forgetting

Advanced Topics in Fine-tuning

Domain Adaptation vs. Task Adaptation

Instruction Tuning

Multi-task Fine-tuning

Continual Learning and Sequential Fine-tuning

Practical Exercises

Exercise 1: LoRA Fine-tuning

Exercise 2: QLoRA for Large Models

Exercise 3: Method Comparison

Conclusion

Additional Resources

Papers

Libraries and Tools

Blog Posts and Tutorials