Evolution of Transformer Models: From BERT to GPT-4

Overview

In our previous lessons, we explored the transformer architecture and various sampling techniques for text generation. Now, we'll trace the foundational evolutionary journey of transformer models that revolutionized NLP from 2018 to 2023.

This lesson examines how the original encoder-decoder transformer architecture branched into specialized variants—encoder-only, decoder-only, and encoder-decoder approaches—each optimized for different tasks. We'll analyze milestone models like BERT, GPT, T5, and understand the key insights that drove this foundational evolution leading up to the modern era.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the architectural differences between encoder-only, decoder-only, and encoder-decoder models
  • Explain the innovations and key contributions of foundational models (BERT, GPT-3, T5, etc.)
  • Compare the strengths and weaknesses of different transformer variants
  • Recognize the relationship between model architecture and NLP task suitability
  • Identify key trends in the foundational evolution of transformer models
  • Apply this knowledge to understand the principles behind architectural choices

The Transformer Family Tree

From General to Specialized Architectures

The original transformer model (Vaswani et al., 2017) introduced a general encoder-decoder architecture for sequence-to-sequence tasks. Since then, transformer models evolved along three main branches:

  1. Encoder-only models (e.g., BERT, RoBERTa): Specialize in understanding language
  2. Decoder-only models (e.g., GPT, GPT-3): Focus on generating language
  3. Encoder-decoder models (e.g., T5, BART): Maintain the full architecture for sequence transformation

Transformer Family Tree

Evolution of transformer-based models over time

Original Transformer (2017)
Encoder-Only
Decoder-Only
Encoder-Decoder
BERT
RoBERTa
DeBERTa
DistilBERT
GPT
GPT-2
GPT-3
GPT-4
T5
BART
Pegasus
mT5

Key models have highlighted borders. Connections show architectural evolution over time.

Analogy: Specialized Tools vs. Swiss Army Knife

Think of the evolution of transformer models like the evolution of tools:

  • The original transformer was like a Swiss Army knife: versatile, but not optimized for any specific task
  • Encoder-only models are like specialized reading glasses: excellent for understanding text but poor at creating it
  • Decoder-only models are like high-quality pens: designed primarily for creating content
  • Encoder-decoder models are like advanced translation devices: optimized for converting one form of text to another

Just as a professional craftsperson selects specific tools for different jobs, NLP systems select transformer variants optimized for particular tasks.

Encoder-Only Models: Understanding Language

BERT: Bidirectional Encoder Representations from Transformers

BERT, introduced by Google in 2018, was a breakthrough that fundamentally changed NLP. It uses only the encoder portion of the transformer architecture but adds two innovative pre-training tasks.

Key Innovations in BERT

  1. Bidirectional attention: Unlike previous models that processed text left-to-right or right-to-left, BERT attends to the entire context simultaneously
  2. Masked Language Modeling (MLM): Randomly masks 15% of tokens and trains the model to predict them
  3. Next Sentence Prediction (NSP): Trains the model to determine if two sentences follow each other in the original text

BERT Pretraining Visualizer

Visualization of BERT's masked language modeling and next sentence prediction

Masked Language Modeling (MLM)

The
[MASK]
is
a
language
model
Prediction for [MASK]: transformer (92%)

Next Sentence Prediction (NSP)

Sentence A: The cat sat on the mat.
Sentence B: It was comfortable there.
Prediction: IsNextSentence (87%)

BERT Architecture Variants

  • BERT-base: 12 transformer layers, 12 attention heads, 768 hidden dimensions (110M parameters)
  • BERT-large: 24 transformer layers, 16 attention heads, 1024 hidden dimensions (340M parameters)

BERT's Impact and Applications

BERT excels in a wide range of understanding tasks:

  • Text classification
  • Named entity recognition
  • Question answering
  • Sentiment analysis
  • Natural language inference

The Fine-tuning Paradigm

BERT introduced a new two-step approach that has become standard:

  1. Pre-training on vast amounts of unlabeled text using self-supervised objectives
  2. Fine-tuning the pre-trained model on specific downstream tasks with labeled data

This approach dramatically reduced the amount of task-specific labeled data needed.

RoBERTa: Robustly Optimized BERT Approach

RoBERTa, introduced by Facebook AI in 2019, showed that BERT was significantly undertrained. It maintains BERT's architecture but introduces several training improvements.

RoBERTa's Improvements Over BERT

  1. More data and longer training: Using 10 times more data and computing power
  2. Larger batches: 8K vs. 256 examples per batch
  3. Dynamic masking: Generating new masked patterns every time a sequence is encountered
  4. Removing NSP: Focusing only on the masked language modeling task
  5. Longer sequences: Training on sequences of up to 512 tokens

These seemingly minor changes led to significantly better performance, highlighting the importance of training methodology.

AspectBERTRoBERTa
Training Data16GB (BookCorpus + Wikipedia)160GB (Including CC-News, OpenWebText, Stories)
Batch Size256 sequences8,000 sequences
Training Steps1,000,000 steps500,000 steps (but larger batches)
Masking StrategyStatic (masked once during preprocessing)Dynamic (masked differently each epoch)
Pre-training TasksMLM + NSPMLM only
Max Sequence Length512 tokens (but often 128)512 tokens throughout training
GLUE Benchmark82.2%88.5%

Other Notable Encoder-Only Innovations

  • ALBERT: Parameter reduction techniques (shared layers, factorized embedding)
  • DistilBERT: Knowledge distillation for a smaller, faster model
  • DeBERTa: Disentangled attention mechanism and enhanced mask decoder
  • ELECTRA: Replaced MLM with a more efficient token detection objective

Decoder-Only Models: Generating Language

GPT: Generative Pre-trained Transformer

The GPT family, starting with the original GPT in 2018 by OpenAI, showcased the power of the transformer decoder for text generation.

Key Characteristics of GPT Models

  1. Autoregressive generation: Models the probability of a token given previous tokens
  2. Unidirectional attention: Each token can only attend to previous tokens (causal attention)
  3. Generative capabilities: Optimized for producing coherent, fluent text

The GPT Evolution: Demonstrating Scaling Laws

GPT-2 showed that scaling up the model (from 117M to 1.5B parameters) and training data led to surprising emergent abilities:

  • Better long-range coherence
  • Improved factual knowledge
  • Ability to perform simple reasoning

GPT-3: Emergence of Few-Shot Learning

GPT-3 (175B parameters) demonstrated a remarkable new capability: few-shot learning through in-context examples.

Input ExampleExpected OutputModel Response
I loved this movie, it was fantastic!PositivePositive (94% confidence)
Terrible service and the food was cold.NegativeNegative (97% confidence)
The experience was neither good nor bad.NeutralNeutral (88% confidence)
The concert exceeded all my expectations, what a night!?Positive (96% confidence)

Note: Few-shot learning demonstration: The model is shown examples 1-3 and then predicts the sentiment of example 4 without explicit training.

The Impact of Scaling Laws

Research by Kaplan et al. (2020) revealed predictable scaling laws in language models that fundamentally changed how we think about model development:

  • Power Law Relationship: As model size increases by 10x, performance improves at a consistent but diminishing rate
  • Measurable Improvements: Language model loss decreases from 2.5 (1M parameters) to 1.1 (1T parameters), a 56% relative improvement
  • Predictable Scaling: This relationship allows researchers to predict performance gains from increasing model size

This discovery enabled researchers to make strategic trade-offs between model size, dataset size, and compute resources, leading to the rapid evolution of increasingly capable language models.

Foundational Model Scaling

Evolution of foundational transformer models 2018-2023

Model Size Evolution

1TB100BB10BB1BB100M
117M
GPT
2018
1.5B
GPT-2
2019
175B
GPT-3
2020
~1T
GPT-4
2023

*Estimated parameters

Performance Evolution

GPTGPT-2GPT-3GPT-40255075100

Key Insights

  • Early transformer models established the architectural foundation
  • Parameter scaling showed dramatic improvements in capabilities
  • GPT-3 demonstrated emergent few-shot learning abilities
  • Context length increased from 512 to 32k+ tokens

The figure above shows how transformer models dramatically scaled in size from GPT to GPT-4, with corresponding improvements in performance following predictable scaling laws.

Encoder-Decoder Models: Transforming Language

T5: Text-to-Text Transfer Transformer

T5, introduced by Google in 2020, returned to the full encoder-decoder architecture, but with a crucial insight: all NLP tasks can be framed as text-to-text problems.

The Text-to-Text Framework

T5 reformulates every NLP task into the same format:

  • Input: Task-specific prefix + original text
  • Output: Target text

T5 Task Demonstrator

Demonstration of T5's text-to-text approach for different NLP tasks

Translation Task

Input:
translate English to German: The house is blue.
Output:
Das Haus ist blau.

Summarization Task

Input:
summarize: The transformer model was introduced in 2017. It uses self-attention mechanism instead of recurrence. This allows for more parallelization.
Output:
Transformer model (2017) replaces recurrence with self-attention for better parallelization.

T5 Variants and Training

T5 was extensively ablated to find optimal training procedures:

  • T5-Small to T5-11B: A range of model sizes from 60M to 11B parameters
  • Extensive pre-training: On the large C4 (Colossal Clean Crawled Corpus)
  • Multiple objectives tested: Vanilla language modeling, corrupted span prediction, etc.

The final T5 approach used a form of span corruption where randomly selected spans of text were replaced with sentinel tokens that the model had to reconstruct.

BART: Bidirectional and Auto-Regressive Transformers

BART, introduced by Facebook AI in 2019, combines the bidirectional encoding of BERT with the autoregressive decoding of GPT.

BART's Innovative Pre-training

BART is pre-trained by:

  1. Corrupting documents with an arbitrary noising function
  2. Learning to reconstruct the original document

This allowed BART to explore various noising approaches:

  • Token masking (like BERT)
  • Token deletion
  • Text infilling (multiple tokens replaced with a single mask)
  • Sentence permutation
  • Document rotation

BART's Flexibility

BART excels at a diverse set of tasks:

  • Sequence classification
  • Token classification
  • Sequence generation
  • Machine translation

Comparing the Three Paradigms

ArchitecturePre-training ObjectiveStrengthsWeaknessesExemplar ModelsBest For
Encoder-OnlyMasked Language ModelingStrong understanding of context and relationshipsLimited generation capabilityBERT, RoBERTa, DeBERTaClassification, NER, Sentiment Analysis
Decoder-OnlyAutoregressive Language ModelingExcellent text generation, emergent abilities at scaleLess effective for understanding context, inefficient for seq2seq tasksGPT, GPT-2, GPT-3Open-ended generation, dialogue, creative writing
Encoder-DecoderSpan corruption, denoisingVersatile, strong at sequence transformation tasksMore complex architecture, higher computational requirementsT5, BART, UL2Translation, Summarization, Question Answering

Foundational Innovations Beyond the Basics

Parameter Efficiency Techniques

As models grew larger, researchers developed methods to make them more efficient:

  1. Parameter Sharing: ALBERT reduced parameters by sharing weights across layers
  2. Low-Rank Approximations: Compressing weight matrices with matrix factorization
  3. Knowledge Distillation: Training smaller "student" models to mimic larger "teacher" models
  4. Quantization: Reducing numerical precision without sacrificing significant performance

Attention Mechanism Improvements

The core attention mechanism also evolved during this foundational period:

  1. Sparse Attention (Longformer, BigBird): Attending to select tokens rather than all
  2. Linear Attention (Linformer, Performer): Reducing complexity from O(n²) to O(n)
  3. Local+Global Attention (Longformer, BigBird): Combining local context with global tokens

Attention Pattern Visualizer

Common attention patterns in transformer models

Self-Attention Pattern

ThecatsatonmatThecatsatonmat0.80.10.10.00.00.20.60.10.00.00.10.30.50.10.00.10.20.40.30.10.00.10.30.40.2

Cross-Attention Pattern

ThehouseisveryblueDasHausistsehrblau0.70.20.10.00.00.10.80.10.00.00.10.10.70.10.10.00.10.20.60.10.00.00.10.20.7

Extending Context Length

Early attempts to extend context windows included:

  1. Recurrence Mechanisms (Transformer-XL): Using memory of previous segments
  2. Position Interpolation (ALiBi): Better ways to encode position
  3. Efficient Attention (Longformer, Performer): Making attention practical for longer sequences

Specialized Adaptations

Multilingual Models

  • mBERT: Trained on Wikipedia in 104 languages
  • XLM-R: Large multilingual model with improved cross-lingual transfer
  • mT5: Multilingual version of T5 covering 101 languages

Domain-Specific Models

  • BioBERT, ClinicalBERT: Specialized for biomedical text
  • SciBERT: Targeted at scientific publications
  • FinBERT: Optimized for financial text
  • LegalBERT: Focused on legal documents

Implementation: Working with Foundational Models

Fine-tuning BERT for Classification

python
from transformers import BertTokenizer, BertForSequenceClassification from transformers import Trainer, TrainingArguments import torch from datasets import load_dataset # Load pre-trained model and tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Load dataset (e.g., IMDB sentiment analysis)

Text Generation with GPT-2

python
from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load pre-trained model and tokenizer model_name = "gpt2-medium" tokenizer = GPT2Tokenizer.from_pretrained(model_name) model = GPT2LMHeadModel.from_pretrained(model_name) # Generate text with the model prompt = "Artificial intelligence will transform society by" input_ids = tokenizer.encode(prompt, return_tensors='pt')

Sequence-to-Sequence Tasks with T5

python
from transformers import T5Tokenizer, T5ForConditionalGeneration # Load pre-trained model and tokenizer model_name = "t5-base" tokenizer = T5Tokenizer.from_pretrained(model_name) model = T5ForConditionalGeneration.from_pretrained(model_name) # Example: Summarization article = """ Researchers have developed a new machine learning model that can predict protein folding with unprecedented accuracy.

Task-to-Architecture Matching

Foundational Principles

TaskEncoder-OnlyDecoder-OnlyEncoder-DecoderPreferred Architecture
Text ClassificationEncoder-Only
Named Entity RecognitionEncoder-Only
Text GenerationDecoder-Only
Machine TranslationEncoder-Decoder
SummarizationEncoder-Decoder
Question AnsweringDepends on Type
Dialog SystemsDecoder-Only

Practical Considerations for Architecture Choice

When choosing a foundational model, consider:

  1. Computational resources: Training and inference costs
  2. Data availability: Amount of labeled data for fine-tuning
  3. Latency requirements: Real-time vs. batch processing
  4. Task specificity: Understanding vs. generation vs. transformation
  5. Pre-training alignment: How well the pre-training objective matches your task

Summary

In this lesson, we've covered:

  1. The foundational evolution of transformer architectures into encoder-only, decoder-only, and encoder-decoder variants
  2. Key milestone models including BERT, GPT-3, T5, and their innovations
  3. Scaling laws and the principles that guided early model development
  4. Architectural trade-offs and how they align with different NLP tasks
  5. Implementation approaches for working with foundational model types
  6. Design principles that continue to influence modern architecture choices

Understanding this foundational evolution provides the context needed to appreciate modern innovations and make informed decisions about architecture selection. These core principles continue to guide transformer development even as new innovations emerge.

Practice Exercises

  1. Comparative Analysis:

    • Fine-tune BERT, GPT-2, and T5 on the same classification task
    • Compare performance, training time, and resource requirements
    • Analyze which aspects of each architecture contribute to differences in performance
  2. Architecture Adaptation:

    • Implement a parameter-efficient fine-tuning approach (adapters, etc.)
    • Compare it to full fine-tuning on a downstream task
    • Measure the trade-offs in performance vs. efficiency
  3. Task Reformulation with T5:

    • Take an NLP task and reformulate it as a text-to-text problem
    • Implement a solution using T5's framework
    • Compare with a traditional approach using separate models
  4. Scaling Law Exploration:

    • Train models of different sizes on the same task
    • Plot performance vs. parameter count
    • Analyze how well the results match theoretical scaling laws

Additional Resources