Overview
In our previous lessons, we explored the transformer architecture and various sampling techniques for text generation. Now, we'll trace the foundational evolutionary journey of transformer models that revolutionized NLP from 2018 to 2023.
This lesson examines how the original encoder-decoder transformer architecture branched into specialized variants—encoder-only, decoder-only, and encoder-decoder approaches—each optimized for different tasks. We'll analyze milestone models like BERT, GPT, T5, and understand the key insights that drove this foundational evolution leading up to the modern era.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the architectural differences between encoder-only, decoder-only, and encoder-decoder models
- Explain the innovations and key contributions of foundational models (BERT, GPT-3, T5, etc.)
- Compare the strengths and weaknesses of different transformer variants
- Recognize the relationship between model architecture and NLP task suitability
- Identify key trends in the foundational evolution of transformer models
- Apply this knowledge to understand the principles behind architectural choices
The Transformer Family Tree
From General to Specialized Architectures
The original transformer model (Vaswani et al., 2017) introduced a general encoder-decoder architecture for sequence-to-sequence tasks. Since then, transformer models evolved along three main branches:
- Encoder-only models (e.g., BERT, RoBERTa): Specialize in understanding language
- Decoder-only models (e.g., GPT, GPT-3): Focus on generating language
- Encoder-decoder models (e.g., T5, BART): Maintain the full architecture for sequence transformation
Transformer Family Tree
Evolution of transformer-based models over time
Key models have highlighted borders. Connections show architectural evolution over time.
Analogy: Specialized Tools vs. Swiss Army Knife
Think of the evolution of transformer models like the evolution of tools:
- The original transformer was like a Swiss Army knife: versatile, but not optimized for any specific task
- Encoder-only models are like specialized reading glasses: excellent for understanding text but poor at creating it
- Decoder-only models are like high-quality pens: designed primarily for creating content
- Encoder-decoder models are like advanced translation devices: optimized for converting one form of text to another
Just as a professional craftsperson selects specific tools for different jobs, NLP systems select transformer variants optimized for particular tasks.
Encoder-Only Models: Understanding Language
BERT: Bidirectional Encoder Representations from Transformers
BERT, introduced by Google in 2018, was a breakthrough that fundamentally changed NLP. It uses only the encoder portion of the transformer architecture but adds two innovative pre-training tasks.
Key Innovations in BERT
- Bidirectional attention: Unlike previous models that processed text left-to-right or right-to-left, BERT attends to the entire context simultaneously
- Masked Language Modeling (MLM): Randomly masks 15% of tokens and trains the model to predict them
- Next Sentence Prediction (NSP): Trains the model to determine if two sentences follow each other in the original text
BERT Pretraining Visualizer
Visualization of BERT's masked language modeling and next sentence prediction
Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)
BERT Architecture Variants
- BERT-base: 12 transformer layers, 12 attention heads, 768 hidden dimensions (110M parameters)
- BERT-large: 24 transformer layers, 16 attention heads, 1024 hidden dimensions (340M parameters)
BERT's Impact and Applications
BERT excels in a wide range of understanding tasks:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis
- Natural language inference
The Fine-tuning Paradigm
BERT introduced a new two-step approach that has become standard:
- Pre-training on vast amounts of unlabeled text using self-supervised objectives
- Fine-tuning the pre-trained model on specific downstream tasks with labeled data
This approach dramatically reduced the amount of task-specific labeled data needed.
RoBERTa: Robustly Optimized BERT Approach
RoBERTa, introduced by Facebook AI in 2019, showed that BERT was significantly undertrained. It maintains BERT's architecture but introduces several training improvements.
RoBERTa's Improvements Over BERT
- More data and longer training: Using 10 times more data and computing power
- Larger batches: 8K vs. 256 examples per batch
- Dynamic masking: Generating new masked patterns every time a sequence is encountered
- Removing NSP: Focusing only on the masked language modeling task
- Longer sequences: Training on sequences of up to 512 tokens
These seemingly minor changes led to significantly better performance, highlighting the importance of training methodology.
Aspect | BERT | RoBERTa |
---|---|---|
Training Data | 16GB (BookCorpus + Wikipedia) | 160GB (Including CC-News, OpenWebText, Stories) |
Batch Size | 256 sequences | 8,000 sequences |
Training Steps | 1,000,000 steps | 500,000 steps (but larger batches) |
Masking Strategy | Static (masked once during preprocessing) | Dynamic (masked differently each epoch) |
Pre-training Tasks | MLM + NSP | MLM only |
Max Sequence Length | 512 tokens (but often 128) | 512 tokens throughout training |
GLUE Benchmark | 82.2% | 88.5% |
Other Notable Encoder-Only Innovations
- ALBERT: Parameter reduction techniques (shared layers, factorized embedding)
- DistilBERT: Knowledge distillation for a smaller, faster model
- DeBERTa: Disentangled attention mechanism and enhanced mask decoder
- ELECTRA: Replaced MLM with a more efficient token detection objective
Decoder-Only Models: Generating Language
GPT: Generative Pre-trained Transformer
The GPT family, starting with the original GPT in 2018 by OpenAI, showcased the power of the transformer decoder for text generation.
Key Characteristics of GPT Models
- Autoregressive generation: Models the probability of a token given previous tokens
- Unidirectional attention: Each token can only attend to previous tokens (causal attention)
- Generative capabilities: Optimized for producing coherent, fluent text
The GPT Evolution: Demonstrating Scaling Laws
GPT-2 showed that scaling up the model (from 117M to 1.5B parameters) and training data led to surprising emergent abilities:
- Better long-range coherence
- Improved factual knowledge
- Ability to perform simple reasoning
GPT-3: Emergence of Few-Shot Learning
GPT-3 (175B parameters) demonstrated a remarkable new capability: few-shot learning through in-context examples.
Input Example | Expected Output | Model Response |
---|---|---|
I loved this movie, it was fantastic! | Positive | Positive (94% confidence) |
Terrible service and the food was cold. | Negative | Negative (97% confidence) |
The experience was neither good nor bad. | Neutral | Neutral (88% confidence) |
The concert exceeded all my expectations, what a night! | ? | Positive (96% confidence) |
Note: Few-shot learning demonstration: The model is shown examples 1-3 and then predicts the sentiment of example 4 without explicit training.
The Impact of Scaling Laws
Research by Kaplan et al. (2020) revealed predictable scaling laws in language models that fundamentally changed how we think about model development:
- Power Law Relationship: As model size increases by 10x, performance improves at a consistent but diminishing rate
- Measurable Improvements: Language model loss decreases from 2.5 (1M parameters) to 1.1 (1T parameters), a 56% relative improvement
- Predictable Scaling: This relationship allows researchers to predict performance gains from increasing model size
This discovery enabled researchers to make strategic trade-offs between model size, dataset size, and compute resources, leading to the rapid evolution of increasingly capable language models.
Foundational Model Scaling
Evolution of foundational transformer models 2018-2023
Model Size Evolution
*Estimated parameters
Performance Evolution
Key Insights
- Early transformer models established the architectural foundation
- Parameter scaling showed dramatic improvements in capabilities
- GPT-3 demonstrated emergent few-shot learning abilities
- Context length increased from 512 to 32k+ tokens
The figure above shows how transformer models dramatically scaled in size from GPT to GPT-4, with corresponding improvements in performance following predictable scaling laws.
Encoder-Decoder Models: Transforming Language
T5: Text-to-Text Transfer Transformer
T5, introduced by Google in 2020, returned to the full encoder-decoder architecture, but with a crucial insight: all NLP tasks can be framed as text-to-text problems.
The Text-to-Text Framework
T5 reformulates every NLP task into the same format:
- Input: Task-specific prefix + original text
- Output: Target text
T5 Task Demonstrator
Demonstration of T5's text-to-text approach for different NLP tasks
Translation Task
Summarization Task
T5 Variants and Training
T5 was extensively ablated to find optimal training procedures:
- T5-Small to T5-11B: A range of model sizes from 60M to 11B parameters
- Extensive pre-training: On the large C4 (Colossal Clean Crawled Corpus)
- Multiple objectives tested: Vanilla language modeling, corrupted span prediction, etc.
The final T5 approach used a form of span corruption where randomly selected spans of text were replaced with sentinel tokens that the model had to reconstruct.
BART: Bidirectional and Auto-Regressive Transformers
BART, introduced by Facebook AI in 2019, combines the bidirectional encoding of BERT with the autoregressive decoding of GPT.
BART's Innovative Pre-training
BART is pre-trained by:
- Corrupting documents with an arbitrary noising function
- Learning to reconstruct the original document
This allowed BART to explore various noising approaches:
- Token masking (like BERT)
- Token deletion
- Text infilling (multiple tokens replaced with a single mask)
- Sentence permutation
- Document rotation
BART's Flexibility
BART excels at a diverse set of tasks:
- Sequence classification
- Token classification
- Sequence generation
- Machine translation
Comparing the Three Paradigms
Architecture | Pre-training Objective | Strengths | Weaknesses | Exemplar Models | Best For |
---|---|---|---|---|---|
Encoder-Only | Masked Language Modeling | Strong understanding of context and relationships | Limited generation capability | BERT, RoBERTa, DeBERTa | Classification, NER, Sentiment Analysis |
Decoder-Only | Autoregressive Language Modeling | Excellent text generation, emergent abilities at scale | Less effective for understanding context, inefficient for seq2seq tasks | GPT, GPT-2, GPT-3 | Open-ended generation, dialogue, creative writing |
Encoder-Decoder | Span corruption, denoising | Versatile, strong at sequence transformation tasks | More complex architecture, higher computational requirements | T5, BART, UL2 | Translation, Summarization, Question Answering |
Foundational Innovations Beyond the Basics
Parameter Efficiency Techniques
As models grew larger, researchers developed methods to make them more efficient:
- Parameter Sharing: ALBERT reduced parameters by sharing weights across layers
- Low-Rank Approximations: Compressing weight matrices with matrix factorization
- Knowledge Distillation: Training smaller "student" models to mimic larger "teacher" models
- Quantization: Reducing numerical precision without sacrificing significant performance
Attention Mechanism Improvements
The core attention mechanism also evolved during this foundational period:
- Sparse Attention (Longformer, BigBird): Attending to select tokens rather than all
- Linear Attention (Linformer, Performer): Reducing complexity from O(n²) to O(n)
- Local+Global Attention (Longformer, BigBird): Combining local context with global tokens
Attention Pattern Visualizer
Common attention patterns in transformer models
Self-Attention Pattern
Cross-Attention Pattern
Extending Context Length
Early attempts to extend context windows included:
- Recurrence Mechanisms (Transformer-XL): Using memory of previous segments
- Position Interpolation (ALiBi): Better ways to encode position
- Efficient Attention (Longformer, Performer): Making attention practical for longer sequences
Specialized Adaptations
Multilingual Models
- mBERT: Trained on Wikipedia in 104 languages
- XLM-R: Large multilingual model with improved cross-lingual transfer
- mT5: Multilingual version of T5 covering 101 languages
Domain-Specific Models
- BioBERT, ClinicalBERT: Specialized for biomedical text
- SciBERT: Targeted at scientific publications
- FinBERT: Optimized for financial text
- LegalBERT: Focused on legal documents
Implementation: Working with Foundational Models
Fine-tuning BERT for Classification
pythonfrom transformers import BertTokenizer, BertForSequenceClassification from transformers import Trainer, TrainingArguments import torch from datasets import load_dataset # Load pre-trained model and tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Load dataset (e.g., IMDB sentiment analysis)
Text Generation with GPT-2
pythonfrom transformers import GPT2LMHeadModel, GPT2Tokenizer # Load pre-trained model and tokenizer model_name = "gpt2-medium" tokenizer = GPT2Tokenizer.from_pretrained(model_name) model = GPT2LMHeadModel.from_pretrained(model_name) # Generate text with the model prompt = "Artificial intelligence will transform society by" input_ids = tokenizer.encode(prompt, return_tensors='pt')
Sequence-to-Sequence Tasks with T5
pythonfrom transformers import T5Tokenizer, T5ForConditionalGeneration # Load pre-trained model and tokenizer model_name = "t5-base" tokenizer = T5Tokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) # Example: Summarization article = """ Researchers have developed a new machine learning model that can predict protein folding with unprecedented accuracy.
Task-to-Architecture Matching
Foundational Principles
Task | Encoder-Only | Decoder-Only | Encoder-Decoder | Preferred Architecture |
---|---|---|---|---|
Text Classification | ✓ | ✓ | Encoder-Only | |
Named Entity Recognition | ✓ | Encoder-Only | ||
Text Generation | ✓ | ✓ | Decoder-Only | |
Machine Translation | ✓ | Encoder-Decoder | ||
Summarization | ✓ | ✓ | Encoder-Decoder | |
Question Answering | ✓ | ✓ | Depends on Type | |
Dialog Systems | ✓ | ✓ | Decoder-Only |
Practical Considerations for Architecture Choice
When choosing a foundational model, consider:
- Computational resources: Training and inference costs
- Data availability: Amount of labeled data for fine-tuning
- Latency requirements: Real-time vs. batch processing
- Task specificity: Understanding vs. generation vs. transformation
- Pre-training alignment: How well the pre-training objective matches your task
Summary
In this lesson, we've covered:
- The foundational evolution of transformer architectures into encoder-only, decoder-only, and encoder-decoder variants
- Key milestone models including BERT, GPT-3, T5, and their innovations
- Scaling laws and the principles that guided early model development
- Architectural trade-offs and how they align with different NLP tasks
- Implementation approaches for working with foundational model types
- Design principles that continue to influence modern architecture choices
Understanding this foundational evolution provides the context needed to appreciate modern innovations and make informed decisions about architecture selection. These core principles continue to guide transformer development even as new innovations emerge.
Practice Exercises
-
Comparative Analysis:
- Fine-tune BERT, GPT-2, and T5 on the same classification task
- Compare performance, training time, and resource requirements
- Analyze which aspects of each architecture contribute to differences in performance
-
Architecture Adaptation:
- Implement a parameter-efficient fine-tuning approach (adapters, etc.)
- Compare it to full fine-tuning on a downstream task
- Measure the trade-offs in performance vs. efficiency
-
Task Reformulation with T5:
- Take an NLP task and reformulate it as a text-to-text problem
- Implement a solution using T5's framework
- Compare with a traditional approach using separate models
-
Scaling Law Exploration:
- Train models of different sizes on the same task
- Plot performance vs. parameter count
- Analyze how well the results match theoretical scaling laws
Additional Resources
- BERT Paper: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT-3 Paper: Language Models are Few-Shot Learners
- T5 Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- BART Paper: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- Scaling Laws for Neural Language Models
- The Illustrated Transformer by Jay Alammar
- Parameter-Efficient Transfer Learning for NLP
- Hugging Face Transformers Library