Fine-tuning Techniques and Parameter-Efficient Methods

Overview

In our previous lessons, we've explored how to train language models from scratch and how to monitor training and engineer datasets. However, training models from scratch is resource-intensive and often unnecessary. Fine-tuning existing pre-trained models is a more efficient approach for most applications.

This lesson focuses on fine-tuning techniques for large language models, with special emphasis on parameter-efficient methods. As models grow to billions of parameters, traditional fine-tuning becomes prohibitively expensive. We'll explore how methods like LoRA, QLoRA, and other PEFT (Parameter-Efficient Fine-Tuning) approaches make it possible to adapt these massive models with limited computational resources.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the differences between pre-training and fine-tuning
  • Implement full fine-tuning for smaller models
  • Apply parameter-efficient fine-tuning techniques like LoRA and adapters
  • Select appropriate fine-tuning strategies based on available resources
  • Diagnose and fix common fine-tuning issues
  • Evaluate fine-tuned models effectively

From Pre-training to Fine-tuning

The Two-phase Learning Paradigm

Modern NLP follows a two-phase approach:

  1. Pre-training: Learning general language patterns from vast amounts of data
  2. Fine-tuning: Adapting the pre-trained model to specific tasks or domains

Analogy: Fine-tuning as Specialized Education

Think of pre-training and fine-tuning as education stages:

  • Pre-training: General education that builds foundational knowledge (like K-12 and undergraduate studies)
  • Fine-tuning: Specialized training for specific professions (like medical school, law school, or vocational training)

Just as a medical student builds upon general knowledge to develop specialized skills, fine-tuning builds upon a pre-trained model's general language understanding to develop task-specific capabilities.

Why Fine-tune?

Optimization Tradeoffs

This visualization shows the tradeoff between different dataset properties as filtering strictness increases. As the filtering becomes more strict (moving right), the dataset size and diversity decrease while the quality increases.

02550751000%10%20%30%40%50%60%70%80%90%100%Dataset PropertiesFiltering StrictnessOptimum PointDataset SizeContent QualityDiversity
Key insights:
  • Optimal filtering balances data quality with quantity and diversity
  • Over-filtering can severely reduce dataset size and diversity
  • Under-filtering leads to lower quality data that may harm model performance
  • The vertical purple line indicates the theoretical optimum balance point

Full Fine-tuning: The Traditional Approach

How Full Fine-tuning Works

Full fine-tuning updates all parameters of a pre-trained model on a downstream task:

  1. Initialize with pre-trained weights
  2. Add task-specific head if needed (e.g., classification layer)
  3. Train on task-specific data with a lower learning rate
  4. Update all parameters throughout the network

Implementing Full Fine-tuning

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import Trainer, TrainingArguments from datasets import load_dataset # Load pre-trained model model_name = 'bert-base-uncased' model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name) # Prepare dataset (example: IMDB sentiment analysis)

Challenges with Full Fine-tuning

As models grow larger, full fine-tuning faces significant challenges:

  1. Memory Requirements:

    • A 7B parameter model in FP16 requires ~14GB just to store
    • Backpropagation requires additional memory for gradients and optimizer states
    • A rule of thumb: need 3-4x model size in GPU memory
  2. Computational Cost:

    • Training cost scales linearly with parameter count
    • Fine-tuning 175B parameter models can cost thousands of dollars
  3. Catastrophic Forgetting:

    • Aggressive fine-tuning can cause the model to "forget" general capabilities
    • Finding the right balance is challenging

Parameter-Efficient Fine-tuning (PEFT)

The PEFT Revolution

Parameter-Efficient Fine-Tuning methods fine-tune only a small subset of parameters while keeping most of the pre-trained model frozen.

Analogy: PEFT as Adding Specialized Tools

Think of PEFT as adding specialized tools to a well-equipped workshop:

  • The workshop (pre-trained model) already has general-purpose tools
  • Instead of rebuilding the entire workshop, you add a few specialized tools (trainable parameters)
  • These specialized tools enable specific tasks while leveraging the existing equipment

Core PEFT Methods

Chart Configuration

Adapter-based Methods

How Adapters Work

Adapters are small neural network modules inserted between layers of a pre-trained model:

  1. Freeze the pre-trained model parameters
  2. Insert adapter modules after certain layers (typically attention or feed-forward)
  3. Train only the adapter parameters
  4. Adapters typically use bottleneck architecture to limit parameter count

Adapter Architecture

Adapters typically use a bottleneck architecture:

  1. Down-project to a small dimension (e.g., 64)
  2. Apply non-linearity (e.g., ReLU or GELU)
  3. Up-project back to original dimension
  4. Add a residual connection
python
import torch import torch.nn as nn class Adapter(nn.Module): def __init__(self, input_dim, bottleneck_dim=64): super().__init__() self.down_project = nn.Linear(input_dim, bottleneck_dim) self.activation = nn.GELU() self.up_project = nn.Linear(bottleneck_dim, input_dim) self.layer_norm = nn.LayerNorm(input_dim)

Implementing Adapters with Transformers

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers.adapters import AdapterConfig, PfeifferConfig from datasets import load_dataset # Load pre-trained model model_name = 'bert-base-uncased' model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name) # Add and activate adapters

Low-Rank Adaptation (LoRA)

The LoRA Principle

LoRA is based on a key insight: the updates to pre-trained weights during fine-tuning often have a low "intrinsic rank".

Analogy: LoRA as Efficient Communication

Think of LoRA like compressing a high-resolution image:

  • Instead of sending the full image (all parameter updates), you send a compressed version
  • The compression works by capturing the most important patterns
  • You can reconstruct a close approximation to the original image with much less data

How LoRA Works

  1. Freeze the pre-trained model weights
  2. For selected weight matrices, learn low-rank update matrices
  3. The original operation Y = WX becomes Y = WX + ∆WX where:
    • W is the frozen pre-trained weight
    • ∆W = BA is the low-rank update (rank r)
    • B is a matrix of shape [original_dim, r]
    • A is a matrix of shape [r, original_dim]

Implementing LoRA

python
import torch import torch.nn as nn import math class LoRALayer(nn.Module): def __init__(self, in_features, out_features, rank=8, alpha=32): super().__init__() self.rank = rank self.alpha = alpha self.scaling = alpha / rank

LoRA with PEFT Library

python
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset # Load pre-trained model model_name = 'facebook/opt-1.3b' # Using a 1.3B parameter model as example model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Define LoRA configuration

Quantized LoRA (QLoRA)

Combining Quantization and LoRA

QLoRA combines two powerful techniques:

  1. Quantization: Reduces the precision of model weights (e.g., from FP16 to 4-bit)
  2. LoRA: Adds trainable low-rank adapters

Why QLoRA Works

  1. Memory Efficiency:

    • 4-bit quantization reduces memory footprint by 4x compared to FP16
    • Only small LoRA modules are kept in higher precision for training
  2. Minimal Performance Loss:

    • Novel quantization techniques like Double Quantization minimize precision loss
    • LoRA updates compensate for any quantization artifacts
Chart Configuration

Implementing QLoRA

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model import torch from datasets import load_dataset # Load pre-trained model in 4-bit quantization model_name = 'meta-llama/Llama-2-7b-hf' # Example with a 7B parameter model model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, # Load model in 4-bit precision

Other PEFT Methods

Prefix Tuning

Prefix tuning prepends trainable vectors (virtual tokens) to the input of each transformer layer:

  1. Freeze the pre-trained model
  2. Add trainable prefix tokens to each layer
  3. These prefix tokens influence the model's behavior through attention

Prompt Tuning and P-Tuning

  • Prompt Tuning: Adds trainable tokens only to the input layer
  • P-Tuning: Uses a small neural network to generate soft prompts

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

A highly parameter-efficient method that scales activations with learned vectors:

  • Requires minimal additional parameters (often <0.1%)
  • Simple element-wise multiplication operation
  • Often works well for cross-lingual transfer

Practical Considerations for Fine-tuning

Selecting the Right Method

Optimization Tradeoffs

This visualization shows the tradeoff between different dataset properties as filtering strictness increases. As the filtering becomes more strict (moving right), the dataset size and diversity decrease while the quality increases.

02550751000%10%20%30%40%50%60%70%80%90%100%Dataset PropertiesFiltering StrictnessOptimum PointDataset SizeContent QualityDiversity
Key insights:
  • Optimal filtering balances data quality with quantity and diversity
  • Over-filtering can severely reduce dataset size and diversity
  • Under-filtering leads to lower quality data that may harm model performance
  • The vertical purple line indicates the theoretical optimum balance point

Decision Framework

Use this framework to select the appropriate fine-tuning method:

  1. When to use Full Fine-tuning:

    • Smaller models (<1B parameters)
    • Abundant computational resources
    • Need maximum performance
  2. When to use LoRA/Adapters:

    • Medium to large models (1B-13B parameters)
    • Limited but substantial resources
    • Need balance of performance and efficiency
  3. When to use QLoRA:

    • Very large models (>7B parameters)
    • Highly constrained resources
    • Consumer-grade hardware
  4. When to use Prefix/Prompt Tuning:

    • Extremely large models
    • Minimal resources
    • Acceptable performance trade-off

Hyperparameter Considerations

Key hyperparameters for PEFT methods:

  1. LoRA-specific:

    • Rank (r): Higher values give better performance but use more parameters
    • Alpha (α): Scaling factor, typically set to 2r
    • Target modules: Which layers to apply LoRA to
  2. Adapter-specific:

    • Bottleneck dimension: Controls adapter size
    • Adapter placement: Which layers to add adapters to
  3. General fine-tuning:

    • Learning rate: Typically lower for fine-tuning (1e-5 to 5e-5)
    • Weight decay: Helps prevent overfitting (0.01 to 0.1)
    • Training epochs: Often fewer for fine-tuning (2-5)

Avoiding Catastrophic Forgetting

Strategies to preserve general capabilities:

  1. Use lower learning rates
  2. Implement early stopping
  3. Apply regularization techniques
  4. Balance task-specific data with general data
  5. Consider multi-task fine-tuning

Advanced Topics in Fine-tuning

Domain Adaptation vs. Task Adaptation

  1. Domain Adaptation:

    • Adapts to a specific domain (e.g., medical, legal)
    • Preserves general capabilities
    • Often requires continued pre-training
  2. Task Adaptation:

    • Focuses on specific tasks (e.g., classification, summarization)
    • May specialize at the expense of generality
    • Typically uses supervised fine-tuning

Instruction Tuning

Fine-tuning models on instruction-following data:

  1. Input format: Typically uses a template like "Instruction: {instruction}\nInput: {input}\nOutput:"
  2. Dataset composition: Mix of different task types and formats
  3. Evaluation: Measures ability to follow diverse instructions
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset import torch # Load model and tokenizer model_name = 'facebook/opt-1.3b' # Using a 1.3B parameter model as example model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)

Multi-task Fine-tuning

Training on multiple tasks simultaneously:

  1. Benefits:

    • Improves generalization
    • Prevents overfitting to a single task
    • Reduces catastrophic forgetting
  2. Implementation:

    • Collect datasets for multiple tasks
    • Balance task representation
    • Add task-specific identifiers or prompts

Continual Learning and Sequential Fine-tuning

Strategies for learning new tasks without forgetting:

  1. Elastic Weight Consolidation (EWC):

    • Identifies important parameters for previous tasks
    • Penalizes changes to these parameters when learning new tasks
  2. Knowledge Distillation:

    • Uses original model as teacher
    • Prevents new model from diverging too far
  3. Replay Methods:

    • Maintains a buffer of examples from previous tasks
    • Intermixes these with new task examples during training

Practical Exercises

Exercise 1: LoRA Fine-tuning

Implement LoRA fine-tuning for a sentiment classification task:

  1. Load a pre-trained model (e.g., BERT or RoBERTa)
  2. Configure LoRA adapters
  3. Fine-tune on a sentiment dataset (e.g., SST-2 or IMDB)
  4. Evaluate performance and parameter efficiency

Exercise 2: QLoRA for Large Models

Use QLoRA to fine-tune a large language model (>7B parameters) on a single GPU:

  1. Set up 4-bit quantization
  2. Configure LoRA adapters
  3. Fine-tune on an instruction dataset
  4. Compare performance before and after fine-tuning

Exercise 3: Method Comparison

Compare different PEFT methods on the same task:

  1. Implement Full Fine-tuning, LoRA, Adapters, and Prefix Tuning
  2. Train each method with the same dataset and hyperparameters
  3. Analyze performance, memory usage, and training time
  4. Recommend the best method for different scenarios

Conclusion

Parameter-efficient fine-tuning methods have democratized access to large language models, making it possible to adapt billion-parameter models with limited resources. These techniques not only reduce computational requirements but often provide comparable performance to full fine-tuning.

As models continue to grow, PEFT methods will become increasingly important. The rapid pace of innovation in this area—from adapters to LoRA to QLoRA—suggests that even more efficient techniques may emerge in the future, further lowering the barrier to working with advanced language models.

In our next lesson, we will explore distributed training infrastructure, enabling you to work with even larger models across multiple devices or machines.

Additional Resources

Papers

  • "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
  • "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
  • "Parameter-Efficient Transfer Learning for NLP" (Houlsby et al., 2019, Adapters)
  • "The Power of Scale for Parameter-Efficient Prompt Tuning" (Lester et al., 2021)

Libraries and Tools

Blog Posts and Tutorials