Overview
In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. While these approaches revolutionized NLP, they share a fundamental limitation: they assign the same vector to a word regardless of its context.
Consider: "I'll bank the money" vs. "I'll bank the fire" vs. "I sat by the river bank"
Traditional embeddings give the word "bank" identical representations in all three sentences, despite completely different meanings.
This lesson introduces contextual embeddings - dynamic representations that change based on surrounding context, enabling machines to understand nuanced word usage and dramatically improving performance across NLP tasks.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the limitations of static word embeddings
- Explain how contextual embedding models like ELMo and BERT work
- Recognize the architectural innovations that enable context-sensitivity
- Compare different contextual embedding approaches
- Understand multimodal embeddings like CLIP
- Apply contextual embeddings to practical NLP tasks
The Need for Context
Why Static Embeddings Fall Short
The core limitation isn't just polysemy - it's that meaning is contextual. Consider these examples:
"Bank" - Multiple Meanings:
- Financial: "The bank approved my loan"
- Geographic: "We sat on the river bank"
- Action: "Please bank the fire before leaving"
"Light" - Context Matters:
- Weight: "This box is light"
- Illumination: "Turn on the light"
- Color: "I prefer light blue"
Traditional embeddings collapse all these meanings into a single vector, losing crucial contextual nuances that humans naturally understand.
The Breakthrough: Dynamic Representations
Contextual embeddings solve this by creating different vectors for the same word in different contexts. It's like having a chameleon word that adapts its representation to its semantic environment.
From Static to Dynamic Representations
The Evolution of Word Representations
The journey from static to contextual embeddings represents one of the most significant advances in NLP:
Year | Model | Key Innovation | Impact |
---|---|---|---|
2003 | Neural Language Models | Distributed word representations | Foundation for modern embeddings |
2013 | Word2Vec | Efficient skip-gram and CBOW training | Democratized word embeddings |
2014 | GloVe | Global co-occurrence statistics | Improved word analogies |
2016 | FastText | Subword information | Better handling of rare words |
2018 | ELMo | Bidirectional LSTM contexts | First major contextual embeddings |
2018 | BERT | Bidirectional transformer pre-training | Revolutionary breakthrough |
2019 | RoBERTa, XLNet | Improved training strategies | Refined contextual understanding |
2020 | DeBERTa | Disentangled attention mechanism | Enhanced BERT architecture |
2021 | CLIP | Text-image joint embeddings | Multimodal understanding |
2023 | E5, BGE | Advanced contrastive learning | Current SOTA on MTEB |
The highlighted models (bold) represent major paradigm shifts that fundamentally changed how we approach word representation.
ELMo: Embeddings from Language Models
ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major contextual embedding model to gain widespread use.
Key Innovation
ELMo uses a bidirectional LSTM trained on a language modeling objective. The embeddings are derived from all internal states of the LSTM, not just the final layer.
Architecture
- Character-level convolutional neural network to handle out-of-vocabulary words
- Multiple layers of bidirectional LSTMs
- Weighted combination of representations from different layers
Mathematical Formulation
For a word in context, ELMo creates a representation:
Where:
- is the contextual representation from the -th layer
- are softmax-normalized weights
- is a scaling parameter
- is the number of layers
Layer Specialization
Different layers capture different types of information:
- Lower layers capture syntactic information (part of speech, word structure)
- Higher layers capture semantic information (word sense, context-specific meaning)
Visualizing ELMo's Contextual Representations
The following visualization shows how ELMo represents the word "bank" differently in various contexts:
BERT: Bidirectional Encoder Representations from Transformers
BERT, introduced by Devlin et al. in 2018, represented a major leap forward by using transformer architecture instead of LSTMs.
Key Innovations
- Bidirectional attention: Words attend to both left and right context simultaneously
- Masked language modeling: Predicts randomly masked tokens using bidirectional context
- Next sentence prediction: Models relationship between sentence pairs
- Transfer learning: Pre-train once, fine-tune for various tasks
Architecture
BERT uses the transformer encoder architecture with:
- Input embeddings (token + position + segment)
- Multiple layers of self-attention and feed-forward networks
- Layer normalization and residual connections
Pre-training Tasks
- Masked Language Model (MLM): Randomly mask 15% of tokens and predict them
- Next Sentence Prediction (NSP): Given two sentences, predict if the second follows the first
BERT Variants
- BERT-base: 12 layers, 768 hidden units, 12 attention heads (110M parameters)
- BERT-large: 24 layers, 1024 hidden units, 16 attention heads (340M parameters)
- Multilingual BERT: Trained on 104 languages
- Domain-specific BERTs: BioBERT (biomedical), SciBERT (scientific), FinBERT (financial)
Visualizing BERT Attention
This visualization shows how BERT's attention mechanism works with an example sentence:
Example: "The cat sat on the mat because it was comfortable"
When BERT processes the word "it", it uses bidirectional attention to understand that "it" refers to "cat" (65% attention weight) rather than "mat" (15% attention weight). This coreference resolution capability demonstrates BERT's contextual understanding.
This attention pattern shows how BERT simultaneously considers:
- Left context: "The cat sat on the mat because"
- Right context: "was comfortable"
- Target word: "it"
The model correctly identifies that "it" refers to "cat" based on semantic relationships and grammatical patterns learned during pre-training.
RoBERTa and Improvements on BERT
RoBERTa (Robustly Optimized BERT Approach) improved BERT by:
- Training longer with more data
- Removing the Next Sentence Prediction objective
- Using dynamic masking patterns
- Using larger batches
- Using a larger byte-level BPE vocabulary
These changes led to significant performance improvements, showing that BERT was underfit rather than fundamentally limited.
The Embedding Benchmarking Revolution
MTEB: Massive Text Embedding Benchmark
The MTEB evaluates embedding models across:
- Retrieval tasks: Finding relevant documents for a query
- Classification tasks: Assigning texts to categories
- Clustering tasks: Grouping similar texts
- Similarity tasks: Measuring semantic similarity
- Reranking tasks: Reordering retrieved documents by relevance
- Summarization tasks: Creating concise summaries of text
- Pair classification tasks: Determining relationships between text pairs
MTEB Leaderboard Performance
The MTEB (Massive Text Embedding Benchmark) shows clear performance differences between embedding models:
Model | MTEB Average Score | Key Strengths |
---|---|---|
E5-large | 65.3 | Advanced contrastive learning |
BGE-Large | 64.5 | Hard negative mining |
GTE-Large | 63.7 | Curriculum learning approach |
CLIP-ViT-L-14 | 62.1 | Multimodal understanding |
MPNet | 59.3 | Permuted language modeling |
SBERT | 58.9 | Sentence-level optimization |
BERT-Large | 54.2 | General contextual embeddings |
RoBERTa | 52.8 | Robust BERT training |
Higher scores indicate better performance across diverse embedding tasks including retrieval, classification, clustering, and similarity.
Sentence-BERT: Efficient Sentence Embeddings
Sentence-BERT (SBERT) modified the BERT architecture to efficiently generate sentence embeddings that can be compared using cosine similarity.
Key Innovations
- Siamese and triplet network structures for training
- Mean pooling over token embeddings
- Contrastive learning objectives
Practical Applications
- Semantic search
- Clustering
- Semantic textual similarity
- Information retrieval
Code Example: Using Sentence Transformers
pythonfrom sentence_transformers import SentenceTransformer, util # Load model model = SentenceTransformer('all-MiniLM-L6-v2') # Prepare sentences sentences = [ "This is an example sentence.", "Each sentence is converted to a vector.", "Sentences with similar meanings have similar vectors.",
Beyond Text: CLIP and Multimodal Embeddings
CLIP (Contrastive Language-Image Pre-training) by OpenAI represents a breakthrough in connecting text and images in the same embedding space.
How CLIP Works
- Train two encoders: one for images (ViT or ResNet) and one for text (Transformer)
- Learn to maximize similarity between correct image-text pairs
- Minimize similarity for incorrect pairs
Contrastive Pre-training
CLIP uses a contrastive objective with large batch sizes (32,768 image-text pairs). For batch , the loss is:
Where:
- is the cosine similarity between image and text embeddings
- is a temperature parameter
- is the batch size
CLIP Applications
- Zero-shot image classification
- Cross-modal retrieval (find images from text, text from images)
- Visual question answering
- Image generation guidance (DALL-E, Stable Diffusion)
Visualizing CLIP's Joint Embedding Space
CLIP creates a revolutionary joint embedding space where text descriptions and their corresponding images are mapped to similar vectors:
Key Concept: Related text-image pairs cluster together in the same embedding space, enabling:
- Cross-modal matching: Find images from text descriptions
- Zero-shot classification: Classify images using only text labels
- Semantic search: Search images using natural language
- Creative applications: Guide image generation models
Example relationships in CLIP space:
- Text: "A dog running in a park" ↔ Image: Photo of a golden retriever in grass
- Text: "Mountain landscape at sunset" ↔ Image: Photo of peaks with orange sky
- Text: "Cartoon character" ↔ Image: Animated drawing
This joint embedding enables unprecedented cross-modal understanding and has powered applications like DALL-E, Stable Diffusion, and multimodal search engines.
State-of-the-Art Embedding Models
E5 Family (Microsoft)
The E5 (Empirical Embeddings) models top the MTEB leaderboard with innovations:
- Unsupervised pre-training with weakly supervised contrastive learning
- Self-teaching with hard negative mining
- Multi-stage training process
BGE (BAAI)
The BGE (BAAI General Embedding) models from the Beijing Academy of Artificial Intelligence:
- Custom hard negative mining strategy
- Diverse training data selection
- Adversarial training techniques
GTE (Alibaba)
The GTE (General Text Embeddings) models feature:
- Curriculum learning approach
- Multi-stage contrastive learning
- Domain-specific fine-tuning
Why Are Contextual Embeddings Better?
Contextual embeddings outperform static embeddings in most NLP tasks for several key reasons:
1. Word Sense Disambiguation
Example Sentence | Word2Vec Representation | BERT Representation |
---|---|---|
"The bank approved my loan application." | Single vector for 'bank' | Financial institution context |
"I sat on the bank of the river." | Same vector as above | River edge context |
"Please bank the fire before leaving." | Same vector as above | Verb 'to cover' context |
This table illustrates a key advantage of contextual embeddings: while traditional models like Word2Vec assign the same vector regardless of usage, models like BERT create distinct representations based on context.
2. Handling Polysemy and Homonyms
Contextual models can distinguish different meanings of the same word form:
- "I used a bat to hit the ball" vs. "The bat flew into the cave"
- "The bass guitar needs tuning" vs. "I caught a bass in the lake"
3. Capturing Syntactic Roles
The same word can serve different syntactic functions, which contextual models capture:
- "Time flies like an arrow" ('flies' as verb)
- "Fruit flies like a banana" ('flies' as noun)
4. Handling Co-reference
Contextual models excel at understanding what pronouns refer to:
- "The trophy didn't fit in the suitcase because it was too large" (what was large?)
5. Incorporating World Knowledge
Pre-training on massive text corpora imbues contextual models with factual knowledge:
- Capital cities, famous people, historical events
- Common sense relationships and properties
Practical Applications of Contextual Embeddings
Semantic Search
pythonfrom transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') # Function to get embeddings def get_embeddings(text_list):
Zero-shot Classification
pythonfrom transformers import pipeline # Load zero-shot classification pipeline classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") # Example text text = "The restaurant food was absolutely wonderful and the service was excellent." # Candidate labels candidate_labels = ["positive", "negative", "neutral"]
Limitations and Challenges
Despite their power, contextual embeddings still face several challenges:
- Computational Cost: Larger models require significant resources
- Tokenization Limitations: Suboptimal handling of rare words and code-switching
- Context Window Size: Limited ability to capture very long-range dependencies
- Bias and Fairness: Models can inherit and amplify biases from training data
- Interpretability: Black-box nature makes it hard to understand why models make certain predictions
The Future of Embeddings
Emerging Trends
- Model Compression: Distillation and pruning to create smaller, faster models
- Multimodal Embeddings: Beyond text-image to include audio, video, and structured data
- Long-Context Models: Extending context windows to handle book-length content
- Task-Specific Adaptations: Specialized embeddings for specific domains and applications
- Unified Representations: Single models handling multiple modalities and tasks
The Efficiency Revolution
Recent advances focus on creating more efficient embeddings:
- FID (Fast Intent Detection): 120x faster inference with minimal quality loss
- E5-Small: Competitive performance with much smaller model size
- Embedding models with quantization for mobile and edge devices
Summary
In this lesson, we've covered:
- The evolution from static to contextual embeddings
- Key models: ELMo, BERT, RoBERTa, and their variants
- Multimodal embeddings with CLIP
- Evaluation benchmarks like MTEB
- Practical applications of contextual embeddings
- Future directions in embedding research
Contextual embeddings have dramatically transformed NLP by capturing the nuanced, context-dependent nature of language. While traditional embeddings opened the door to modern NLP, contextual models have pushed capabilities far beyond what was previously possible.
In our next lesson, we'll explore pre-transformer models like RNNs, LSTMs, and GRUs, which were the state-of-the-art before the transformer revolution that enabled today's contextual embeddings.
Practice Exercises
-
Contextual Analysis:
- Compare how BERT and Word2Vec handle ambiguous words in different contexts
- Visualize the difference using dimensionality reduction techniques
-
Embedding-Based Semantic Search:
- Build a simple semantic search engine using Sentence-BERT
- Compare its performance with keyword-based search
-
Zero-Shot Classification:
- Implement a zero-shot classifier using pre-trained embeddings
- Evaluate its performance on a dataset without fine-tuning
-
Cross-Lingual Embeddings:
- Explore how multilingual models handle translation and cross-lingual tasks
- Test semantic similarity across languages
Additional Resources
- ELMo: Deep Contextualized Word Representations
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- MTEB: Massive Text Embedding Benchmark
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Hugging Face Transformers Documentation
- The Illustrated BERT