Evolution of Transformer Models: From BERT to GPT-4

Overview

In our previous lessons, we explored the transformer architecture and various sampling techniques for text generation. Now, we'll trace the foundational evolutionary journey of transformer models that revolutionized NLP from 2018 to 2023.

This lesson examines how the original encoder-decoder transformer architecture branched into specialized variants—encoder-only, decoder-only, and encoder-decoder approaches—each optimized for different tasks. We'll analyze milestone models like BERT, GPT, T5, and understand the key insights that drove this foundational evolution leading up to the modern era.

Learning Objectives

After completing this lesson, you will be able to:

Understand the architectural differences between encoder-only, decoder-only, and encoder-decoder models
Explain the innovations and key contributions of foundational models (BERT, GPT-3, T5, etc.)
Compare the strengths and weaknesses of different transformer variants
Recognize the relationship between model architecture and NLP task suitability
Identify key trends in the foundational evolution of transformer models
Apply this knowledge to understand the principles behind architectural choices

The Transformer Family Tree

From General to Specialized Architectures

The original transformer model (Vaswani et al., 2017) introduced a general encoder-decoder architecture for sequence-to-sequence tasks. Since then, transformer models evolved along three main branches:

Encoder-only models (e.g., BERT, RoBERTa): Specialize in understanding language
Decoder-only models (e.g., GPT, GPT-3): Focus on generating language
Encoder-decoder models (e.g., T5, BART): Maintain the full architecture for sequence transformation

Transformer Family Tree

Evolution of transformer-based models over time

Original Transformer (2017)

Encoder-Only

Decoder-Only

Encoder-Decoder

BERT

RoBERTa

DeBERTa

DistilBERT

GPT

GPT-2

GPT-3

GPT-4

BART

Pegasus

mT5

Key models have highlighted borders. Connections show architectural evolution over time.

Analogy: Specialized Tools vs. Swiss Army Knife

Think of the evolution of transformer models like the evolution of tools:

The original transformer was like a Swiss Army knife: versatile, but not optimized for any specific task
Encoder-only models are like specialized reading glasses: excellent for understanding text but poor at creating it
Decoder-only models are like high-quality pens: designed primarily for creating content
Encoder-decoder models are like advanced translation devices: optimized for converting one form of text to another

Just as a professional craftsperson selects specific tools for different jobs, NLP systems select transformer variants optimized for particular tasks.

Encoder-Only Models: Understanding Language

BERT: Bidirectional Encoder Representations from Transformers

BERT, introduced by Google in 2018, was a breakthrough that fundamentally changed NLP. It uses only the encoder portion of the transformer architecture but adds two innovative pre-training tasks.

Key Innovations in BERT

Bidirectional attention: Unlike previous models that processed text left-to-right or right-to-left, BERT attends to the entire context simultaneously
Masked Language Modeling (MLM): Randomly masks 15% of tokens and trains the model to predict them
Next Sentence Prediction (NSP): Trains the model to determine if two sentences follow each other in the original text

BERT Pretraining Visualizer

Visualization of BERT's masked language modeling and next sentence prediction

Masked Language Modeling (MLM)

The

[MASK]

language

model

Prediction for [MASK]: transformer (92%)

Next Sentence Prediction (NSP)

Sentence A: The cat sat on the mat.

Sentence B: It was comfortable there.

Prediction: IsNextSentence (87%)

BERT Architecture Variants

BERT-base: 12 transformer layers, 12 attention heads, 768 hidden dimensions (110M parameters)
BERT-large: 24 transformer layers, 16 attention heads, 1024 hidden dimensions (340M parameters)

BERT's Impact and Applications

BERT excels in a wide range of understanding tasks:

Text classification
Named entity recognition
Question answering
Sentiment analysis
Natural language inference

The Fine-tuning Paradigm

BERT introduced a new two-step approach that has become standard:

Pre-training on vast amounts of unlabeled text using self-supervised objectives
Fine-tuning the pre-trained model on specific downstream tasks with labeled data

This approach dramatically reduced the amount of task-specific labeled data needed.

RoBERTa: Robustly Optimized BERT Approach

RoBERTa, introduced by Facebook AI in 2019, showed that BERT was significantly undertrained. It maintains BERT's architecture but introduces several training improvements.

RoBERTa's Improvements Over BERT

More data and longer training: Using 10 times more data and computing power
Larger batches: 8K vs. 256 examples per batch
Dynamic masking: Generating new masked patterns every time a sequence is encountered
Removing NSP: Focusing only on the masked language modeling task
Longer sequences: Training on sequences of up to 512 tokens

These seemingly minor changes led to significantly better performance, highlighting the importance of training methodology.

Aspect	BERT	RoBERTa
Training Data	16GB (BookCorpus + Wikipedia)	160GB (Including CC-News, OpenWebText, Stories)
Batch Size	256 sequences	8,000 sequences
Training Steps	1,000,000 steps	500,000 steps (but larger batches)
Masking Strategy	Static (masked once during preprocessing)	Dynamic (masked differently each epoch)
Pre-training Tasks	MLM + NSP	MLM only
Max Sequence Length	512 tokens (but often 128)	512 tokens throughout training
GLUE Benchmark	82.2%	88.5%

Other Notable Encoder-Only Innovations

ALBERT: Parameter reduction techniques (shared layers, factorized embedding)
DistilBERT: Knowledge distillation for a smaller, faster model
DeBERTa: Disentangled attention mechanism and enhanced mask decoder
ELECTRA: Replaced MLM with a more efficient token detection objective

Decoder-Only Models: Generating Language

GPT: Generative Pre-trained Transformer

The GPT family, starting with the original GPT in 2018 by OpenAI, showcased the power of the transformer decoder for text generation.

Key Characteristics of GPT Models

Autoregressive generation: Models the probability of a token given previous tokens
Unidirectional attention: Each token can only attend to previous tokens (causal attention)
Generative capabilities: Optimized for producing coherent, fluent text

The GPT Evolution: Demonstrating Scaling Laws

GPT-2 showed that scaling up the model (from 117M to 1.5B parameters) and training data led to surprising emergent abilities:

Better long-range coherence
Improved factual knowledge
Ability to perform simple reasoning

GPT-3: Emergence of Few-Shot Learning

GPT-3 (175B parameters) demonstrated a remarkable new capability: few-shot learning through in-context examples.

Input Example	Expected Output	Model Response
I loved this movie, it was fantastic!	Positive	Positive (94% confidence)
Terrible service and the food was cold.	Negative	Negative (97% confidence)
The experience was neither good nor bad.	Neutral	Neutral (88% confidence)
The concert exceeded all my expectations, what a night!	?	Positive (96% confidence)

Note: Few-shot learning demonstration: The model is shown examples 1-3 and then predicts the sentiment of example 4 without explicit training.

The Impact of Scaling Laws

Research by Kaplan et al. (2020) revealed predictable scaling laws in language models that fundamentally changed how we think about model development:

Power Law Relationship: As model size increases by 10x, performance improves at a consistent but diminishing rate
Measurable Improvements: Language model loss decreases from 2.5 (1M parameters) to 1.1 (1T parameters), a 56% relative improvement
Predictable Scaling: This relationship allows researchers to predict performance gains from increasing model size

This discovery enabled researchers to make strategic trade-offs between model size, dataset size, and compute resources, leading to the rapid evolution of increasingly capable language models.

Foundational Model Scaling

Evolution of foundational transformer models 2018-2023

Model Size Evolution

1TB100BB10BB1BB100M

117M

GPT

2018

1.5B

GPT-2

2019

175B

GPT-3

2020

~1T

GPT-4

2023

*Estimated parameters

Performance Evolution

Key Insights

Early transformer models established the architectural foundation
Parameter scaling showed dramatic improvements in capabilities
GPT-3 demonstrated emergent few-shot learning abilities
Context length increased from 512 to 32k+ tokens

The figure above shows how transformer models dramatically scaled in size from GPT to GPT-4, with corresponding improvements in performance following predictable scaling laws.

Encoder-Decoder Models: Transforming Language

T5: Text-to-Text Transfer Transformer

T5, introduced by Google in 2020, returned to the full encoder-decoder architecture, but with a crucial insight: all NLP tasks can be framed as text-to-text problems.

The Text-to-Text Framework

T5 reformulates every NLP task into the same format:

Input: Task-specific prefix + original text
Output: Target text

T5 Task Demonstrator

Demonstration of T5's text-to-text approach for different NLP tasks

Translation Task

Input:

translate English to German: The house is blue.

Output:

Das Haus ist blau.

Summarization Task

Input:

summarize: The transformer model was introduced in 2017. It uses self-attention mechanism instead of recurrence. This allows for more parallelization.

Output:

Transformer model (2017) replaces recurrence with self-attention for better parallelization.

T5 Variants and Training

T5 was extensively ablated to find optimal training procedures:

T5-Small to T5-11B: A range of model sizes from 60M to 11B parameters
Extensive pre-training: On the large C4 (Colossal Clean Crawled Corpus)
Multiple objectives tested: Vanilla language modeling, corrupted span prediction, etc.

The final T5 approach used a form of span corruption where randomly selected spans of text were replaced with sentinel tokens that the model had to reconstruct.

BART: Bidirectional and Auto-Regressive Transformers

BART, introduced by Facebook AI in 2019, combines the bidirectional encoding of BERT with the autoregressive decoding of GPT.

BART's Innovative Pre-training

BART is pre-trained by:

Corrupting documents with an arbitrary noising function
Learning to reconstruct the original document

This allowed BART to explore various noising approaches:

Token masking (like BERT)
Token deletion
Text infilling (multiple tokens replaced with a single mask)
Sentence permutation
Document rotation

BART's Flexibility

BART excels at a diverse set of tasks:

Sequence classification
Token classification
Sequence generation
Machine translation

Comparing the Three Paradigms

Architecture	Pre-training Objective	Strengths	Weaknesses	Exemplar Models	Best For
Encoder-Only	Masked Language Modeling	Strong understanding of context and relationships	Limited generation capability	BERT, RoBERTa, DeBERTa	Classification, NER, Sentiment Analysis
Decoder-Only	Autoregressive Language Modeling	Excellent text generation, emergent abilities at scale	Less effective for understanding context, inefficient for seq2seq tasks	GPT, GPT-2, GPT-3	Open-ended generation, dialogue, creative writing
Encoder-Decoder	Span corruption, denoising	Versatile, strong at sequence transformation tasks	More complex architecture, higher computational requirements	T5, BART, UL2	Translation, Summarization, Question Answering

Foundational Innovations Beyond the Basics

Parameter Efficiency Techniques

As models grew larger, researchers developed methods to make them more efficient:

Parameter Sharing: ALBERT reduced parameters by sharing weights across layers
Low-Rank Approximations: Compressing weight matrices with matrix factorization
Knowledge Distillation: Training smaller "student" models to mimic larger "teacher" models
Quantization: Reducing numerical precision without sacrificing significant performance

Attention Mechanism Improvements

The core attention mechanism also evolved during this foundational period:

Sparse Attention (Longformer, BigBird): Attending to select tokens rather than all
Linear Attention (Linformer, Performer): Reducing complexity from O(n²) to O(n)
Local+Global Attention (Longformer, BigBird): Combining local context with global tokens

Attention Pattern Visualizer

Common attention patterns in transformer models

Self-Attention Pattern

Cross-Attention Pattern

Extending Context Length

Early attempts to extend context windows included:

Recurrence Mechanisms (Transformer-XL): Using memory of previous segments
Position Interpolation (ALiBi): Better ways to encode position
Efficient Attention (Longformer, Performer): Making attention practical for longer sequences

Specialized Adaptations

Multilingual Models

mBERT: Trained on Wikipedia in 104 languages
XLM-R: Large multilingual model with improved cross-lingual transfer
mT5: Multilingual version of T5 covering 101 languages

Domain-Specific Models

BioBERT, ClinicalBERT: Specialized for biomedical text
SciBERT: Targeted at scientific publications
FinBERT: Optimized for financial text
LegalBERT: Focused on legal documents

Implementation: Working with Foundational Models

Fine-tuning BERT for Classification

python
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from datasets import load_dataset

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Load dataset (e.g., IMDB sentiment analysis)

Text Generation with GPT-2

python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Generate text with the model
prompt = "Artificial intelligence will transform society by"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

Sequence-to-Sequence Tasks with T5

python
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained model and tokenizer
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Example: Summarization
article = """
Researchers have developed a new machine learning model that can predict protein folding with unprecedented accuracy.

Task-to-Architecture Matching

Foundational Principles

Task	Encoder-Only	Decoder-Only	Encoder-Decoder	Preferred Architecture
Text Classification	✓		✓	Encoder-Only
Named Entity Recognition	✓			Encoder-Only
Text Generation		✓	✓	Decoder-Only
Machine Translation			✓	Encoder-Decoder
Summarization		✓	✓	Encoder-Decoder
Question Answering	✓		✓	Depends on Type
Dialog Systems		✓	✓	Decoder-Only

Practical Considerations for Architecture Choice

When choosing a foundational model, consider:

Computational resources: Training and inference costs
Data availability: Amount of labeled data for fine-tuning
Latency requirements: Real-time vs. batch processing
Task specificity: Understanding vs. generation vs. transformation
Pre-training alignment: How well the pre-training objective matches your task

Summary

In this lesson, we've covered:

The foundational evolution of transformer architectures into encoder-only, decoder-only, and encoder-decoder variants
Key milestone models including BERT, GPT-3, T5, and their innovations
Scaling laws and the principles that guided early model development
Architectural trade-offs and how they align with different NLP tasks
Implementation approaches for working with foundational model types
Design principles that continue to influence modern architecture choices

Understanding this foundational evolution provides the context needed to appreciate modern innovations and make informed decisions about architecture selection. These core principles continue to guide transformer development even as new innovations emerge.

Practice Exercises

Comparative Analysis:
- Fine-tune BERT, GPT-2, and T5 on the same classification task
- Compare performance, training time, and resource requirements
- Analyze which aspects of each architecture contribute to differences in performance
Architecture Adaptation:
- Implement a parameter-efficient fine-tuning approach (adapters, etc.)
- Compare it to full fine-tuning on a downstream task
- Measure the trade-offs in performance vs. efficiency
Task Reformulation with T5:
- Take an NLP task and reformulate it as a text-to-text problem
- Implement a solution using T5's framework
- Compare with a traditional approach using separate models
Scaling Law Exploration:
- Train models of different sizes on the same task
- Plot performance vs. parameter count
- Analyze how well the results match theoretical scaling laws

NLP Fundamentals: Core Concepts and Architectures