Contextual Embeddings and Modern Representations

Overview

In our previous lesson, we explored traditional word embeddings like Word2Vec, GloVe, and FastText. While these approaches revolutionized NLP, they share a fundamental limitation: they assign the same vector to a word regardless of its context.

Consider: "I'll bank the money" vs. "I'll bank the fire" vs. "I sat by the river bank"

Traditional embeddings give the word "bank" identical representations in all three sentences, despite completely different meanings.

This lesson introduces contextual embeddings - dynamic representations that change based on surrounding context, enabling machines to understand nuanced word usage and dramatically improving performance across NLP tasks.

Learning Objectives

After completing this lesson, you will be able to:

Understand the limitations of static word embeddings
Explain how contextual embedding models like ELMo and BERT work
Recognize the architectural innovations that enable context-sensitivity
Compare different contextual embedding approaches
Understand multimodal embeddings like CLIP
Apply contextual embeddings to practical NLP tasks

The Need for Context

Why Static Embeddings Fall Short

The core limitation isn't just polysemy - it's that meaning is contextual. Consider these examples:

"Bank" - Multiple Meanings:

Financial: "The bank approved my loan"
Geographic: "We sat on the river bank"
Action: "Please bank the fire before leaving"

"Light" - Context Matters:

Weight: "This box is light"
Illumination: "Turn on the light"
Color: "I prefer light blue"

Traditional embeddings collapse all these meanings into a single vector, losing crucial contextual nuances that humans naturally understand.

The Breakthrough: Dynamic Representations

Contextual embeddings solve this by creating different vectors for the same word in different contexts. It's like having a chameleon word that adapts its representation to its semantic environment.

From Static to Dynamic Representations

The Evolution of Word Representations

The journey from static to contextual embeddings represents one of the most significant advances in NLP:

Year	Model	Key Innovation	Impact
2003	Neural Language Models	Distributed word representations	Foundation for modern embeddings
2013	Word2Vec	Efficient skip-gram and CBOW training	Democratized word embeddings
2014	GloVe	Global co-occurrence statistics	Improved word analogies
2016	FastText	Subword information	Better handling of rare words
2018	ELMo	Bidirectional LSTM contexts	First major contextual embeddings
2018	BERT	Bidirectional transformer pre-training	Revolutionary breakthrough
2019	RoBERTa, XLNet	Improved training strategies	Refined contextual understanding
2020	DeBERTa	Disentangled attention mechanism	Enhanced BERT architecture
2021	CLIP	Text-image joint embeddings	Multimodal understanding
2023	E5, BGE	Advanced contrastive learning	Current SOTA on MTEB

The highlighted models (bold) represent major paradigm shifts that fundamentally changed how we approach word representation.

ELMo: Embeddings from Language Models

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, was the first major contextual embedding model to gain widespread use.

Key Innovation

ELMo uses a bidirectional LSTM trained on a language modeling objective. The embeddings are derived from all internal states of the LSTM, not just the final layer.

Architecture

Character-level convolutional neural network to handle out-of-vocabulary words
Multiple layers of bidirectional LSTMs
Weighted combination of representations from different layers

Mathematical Formulation

For a word $w_k$ in context, ELMo creates a representation:

$\text{ELMo}_k = \gamma \sum_{j=0}^L s_j \mathbf{h}_{k,j}^{LM}$

Where:

$\mathbf{h}_{k,j}^{LM}$ is the contextual representation from the $j$ -th layer
$s_j$ are softmax-normalized weights
$\gamma$ is a scaling parameter
$L$ is the number of layers

Layer Specialization

Different layers capture different types of information:

Lower layers capture syntactic information (part of speech, word structure)
Higher layers capture semantic information (word sense, context-specific meaning)

Visualizing ELMo's Contextual Representations

The following visualization shows how ELMo represents the word "bank" differently in various contexts:

Word Sense Disambiguation

This visualization shows how contextual embeddings position the same word differently based on its meaning in context.

Embedding Space Visualization

Dimension 1

Dimension 2

Legend:

bank (financial)

bank (river)

bank (verb)

Example Contexts

bank (financial)

I deposited money into my bank account yesterday.

bank (river)

We sat on the bank of the river watching boats pass by.

bank (verb)

The pilot had to bank the aircraft sharply to avoid the mountain.

Contextual vs. Static Embeddings

Traditional word embeddings like Word2Vec assign the same vector to a word regardless of context. Contextual embeddings like ELMo and BERT create different vectors based on the surrounding words, allowing them to distinguish between different meanings of the same word.

Other Ambiguous Words

bank

financial institution • river edge • to tilt

light

not heavy • brightness • to ignite

run

to move quickly • to operate • a series

spring

season • coiled metal • water source

bear

animal • to endure • stock market term

BERT: Bidirectional Encoder Representations from Transformers

BERT, introduced by Devlin et al. in 2018, represented a major leap forward by using transformer architecture instead of LSTMs.

Key Innovations

Bidirectional attention: Words attend to both left and right context simultaneously
Masked language modeling: Predicts randomly masked tokens using bidirectional context
Next sentence prediction: Models relationship between sentence pairs
Transfer learning: Pre-train once, fine-tune for various tasks

Architecture

BERT uses the transformer encoder architecture with:

Input embeddings (token + position + segment)
Multiple layers of self-attention and feed-forward networks
Layer normalization and residual connections

Pre-training Tasks

Masked Language Model (MLM): Randomly mask 15% of tokens and predict them
Next Sentence Prediction (NSP): Given two sentences, predict if the second follows the first

BERT Variants

BERT-base: 12 layers, 768 hidden units, 12 attention heads (110M parameters)
BERT-large: 24 layers, 1024 hidden units, 16 attention heads (340M parameters)
Multilingual BERT: Trained on 104 languages
Domain-specific BERTs: BioBERT (biomedical), SciBERT (scientific), FinBERT (financial)

Visualizing BERT Attention

This visualization shows how BERT's attention mechanism works with an example sentence:

Example: "The cat sat on the mat because it was comfortable"

When BERT processes the word "it", it uses bidirectional attention to understand that "it" refers to "cat" (65% attention weight) rather than "mat" (15% attention weight). This coreference resolution capability demonstrates BERT's contextual understanding.

This attention pattern shows how BERT simultaneously considers:

Left context: "The cat sat on the mat because"
Right context: "was comfortable"
Target word: "it"

The model correctly identifies that "it" refers to "cat" based on semantic relationships and grammatical patterns learned during pre-training.

RoBERTa and Improvements on BERT

RoBERTa (Robustly Optimized BERT Approach) improved BERT by:

Training longer with more data
Removing the Next Sentence Prediction objective
Using dynamic masking patterns
Using larger batches
Using a larger byte-level BPE vocabulary

These changes led to significant performance improvements, showing that BERT was underfit rather than fundamentally limited.

The Embedding Benchmarking Revolution

MTEB: Massive Text Embedding Benchmark

The MTEB evaluates embedding models across:

Retrieval tasks: Finding relevant documents for a query
Classification tasks: Assigning texts to categories
Clustering tasks: Grouping similar texts
Similarity tasks: Measuring semantic similarity
Reranking tasks: Reordering retrieved documents by relevance
Summarization tasks: Creating concise summaries of text
Pair classification tasks: Determining relationships between text pairs

MTEB Leaderboard Performance

The MTEB (Massive Text Embedding Benchmark) shows clear performance differences between embedding models:

Model	MTEB Average Score	Key Strengths
E5-large	65.3	Advanced contrastive learning
BGE-Large	64.5	Hard negative mining
GTE-Large	63.7	Curriculum learning approach
CLIP-ViT-L-14	62.1	Multimodal understanding
MPNet	59.3	Permuted language modeling
SBERT	58.9	Sentence-level optimization
BERT-Large	54.2	General contextual embeddings
RoBERTa	52.8	Robust BERT training

Higher scores indicate better performance across diverse embedding tasks including retrieval, classification, clustering, and similarity.

Sentence-BERT: Efficient Sentence Embeddings

Sentence-BERT (SBERT) modified the BERT architecture to efficiently generate sentence embeddings that can be compared using cosine similarity.

Key Innovations

Siamese and triplet network structures for training
Mean pooling over token embeddings
Contrastive learning objectives

Practical Applications

Semantic search
Clustering
Semantic textual similarity
Information retrieval

Code Example: Using Sentence Transformers

python
from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted to a vector.",
    "Sentences with similar meanings have similar vectors.",

Beyond Text: CLIP and Multimodal Embeddings

CLIP (Contrastive Language-Image Pre-training) by OpenAI represents a breakthrough in connecting text and images in the same embedding space.

How CLIP Works

Train two encoders: one for images (ViT or ResNet) and one for text (Transformer)
Learn to maximize similarity between correct image-text pairs
Minimize similarity for incorrect pairs

Contrastive Pre-training

CLIP uses a contrastive objective with large batch sizes (32,768 image-text pairs). For batch $B$ , the loss is:

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)}$

Where:

$\text{sim}(I, T)$ is the cosine similarity between image and text embeddings
$\tau$ is a temperature parameter
$N$ is the batch size

CLIP Applications

Zero-shot image classification
Cross-modal retrieval (find images from text, text from images)
Visual question answering
Image generation guidance (DALL-E, Stable Diffusion)

Visualizing CLIP's Joint Embedding Space

CLIP creates a revolutionary joint embedding space where text descriptions and their corresponding images are mapped to similar vectors:

Key Concept: Related text-image pairs cluster together in the same embedding space, enabling:

Cross-modal matching: Find images from text descriptions
Zero-shot classification: Classify images using only text labels
Semantic search: Search images using natural language
Creative applications: Guide image generation models

Example relationships in CLIP space:

Text: "A dog running in a park" ↔ Image: Photo of a golden retriever in grass
Text: "Mountain landscape at sunset" ↔ Image: Photo of peaks with orange sky
Text: "Cartoon character" ↔ Image: Animated drawing

This joint embedding enables unprecedented cross-modal understanding and has powered applications like DALL-E, Stable Diffusion, and multimodal search engines.

Embedding Models Comparison

Compare how different embedding models represent words and their relationships. Select models and words to see how the results differ across approaches.

Select Models (up to 3)

Word2Vec (2013)

OOV:

Subword:

Context:

GloVe (2014)

OOV:

Subword:

Context:

FastText (2016)

OOV:

Yes

Subword:

Yes

Context:

ELMo (2018)

OOV:

Limited

Subword:

Yes

Context:

Yes

BERT (2018)

OOV:

Limited

Subword:

Yes

Context:

Yes

Model Features

Out-of-Vocabulary (OOV) Handling

The ability to generate embeddings for words not seen during training.

Subword Information

Utilizing character n-grams or other subword features to build word representations.

Contextual Awareness

Whether the model generates different representations for the same word in different contexts.

Select Word to Compare

Similar Words by Model

Model	Top Similar Words
Word2Vec 2013	accountmoneyloanfinancialcredit
FastText 2016	banksbankingbankermoneyfinancial

Key Differences

Word2Vec and GloVe use whole-word vectors, making them struggle with rare words.
FastText adds subword information, improving handling of morphologically rich languages and typos.
ELMo and BERT create contextualized embeddings that change based on surrounding words.
Notice how models prioritize different relationships (semantic vs. syntactic) in their similar words.

State-of-the-Art Embedding Models

E5 Family (Microsoft)

The E5 (Empirical Embeddings) models top the MTEB leaderboard with innovations:

Unsupervised pre-training with weakly supervised contrastive learning
Self-teaching with hard negative mining
Multi-stage training process

BGE (BAAI)

The BGE (BAAI General Embedding) models from the Beijing Academy of Artificial Intelligence:

Custom hard negative mining strategy
Diverse training data selection
Adversarial training techniques

GTE (Alibaba)

The GTE (General Text Embeddings) models feature:

Curriculum learning approach
Multi-stage contrastive learning
Domain-specific fine-tuning

Why Are Contextual Embeddings Better?

Contextual embeddings outperform static embeddings in most NLP tasks for several key reasons:

1. Word Sense Disambiguation

Example Sentence	Word2Vec Representation	BERT Representation
"The bank approved my loan application."	Single vector for 'bank'	Financial institution context
"I sat on the bank of the river."	Same vector as above	River edge context
"Please bank the fire before leaving."	Same vector as above	Verb 'to cover' context

This table illustrates a key advantage of contextual embeddings: while traditional models like Word2Vec assign the same vector regardless of usage, models like BERT create distinct representations based on context.

2. Handling Polysemy and Homonyms

Contextual models can distinguish different meanings of the same word form:

"I used a bat to hit the ball" vs. "The bat flew into the cave"
"The bass guitar needs tuning" vs. "I caught a bass in the lake"

3. Capturing Syntactic Roles

The same word can serve different syntactic functions, which contextual models capture:

"Time flies like an arrow" ('flies' as verb)
"Fruit flies like a banana" ('flies' as noun)

4. Handling Co-reference

Contextual models excel at understanding what pronouns refer to:

"The trophy didn't fit in the suitcase because it was too large" (what was large?)

5. Incorporating World Knowledge

Pre-training on massive text corpora imbues contextual models with factual knowledge:

Capital cities, famous people, historical events
Common sense relationships and properties

Practical Applications of Contextual Embeddings

Semantic Search

python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Function to get embeddings
def get_embeddings(text_list):

Zero-shot Classification

python
from transformers import pipeline

# Load zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Example text
text = "The restaurant food was absolutely wonderful and the service was excellent."

# Candidate labels
candidate_labels = ["positive", "negative", "neutral"]

Limitations and Challenges

Despite their power, contextual embeddings still face several challenges:

Computational Cost: Larger models require significant resources
Tokenization Limitations: Suboptimal handling of rare words and code-switching
Context Window Size: Limited ability to capture very long-range dependencies
Bias and Fairness: Models can inherit and amplify biases from training data
Interpretability: Black-box nature makes it hard to understand why models make certain predictions

The Future of Embeddings

Emerging Trends

Model Compression: Distillation and pruning to create smaller, faster models
Multimodal Embeddings: Beyond text-image to include audio, video, and structured data
Long-Context Models: Extending context windows to handle book-length content
Task-Specific Adaptations: Specialized embeddings for specific domains and applications
Unified Representations: Single models handling multiple modalities and tasks

The Efficiency Revolution

Recent advances focus on creating more efficient embeddings:

FID (Fast Intent Detection): 120x faster inference with minimal quality loss
E5-Small: Competitive performance with much smaller model size
Embedding models with quantization for mobile and edge devices

Summary

In this lesson, we've covered:

The evolution from static to contextual embeddings
Key models: ELMo, BERT, RoBERTa, and their variants
Multimodal embeddings with CLIP
Evaluation benchmarks like MTEB
Practical applications of contextual embeddings
Future directions in embedding research

Contextual embeddings have dramatically transformed NLP by capturing the nuanced, context-dependent nature of language. While traditional embeddings opened the door to modern NLP, contextual models have pushed capabilities far beyond what was previously possible.

In our next lesson, we'll explore pre-transformer models like RNNs, LSTMs, and GRUs, which were the state-of-the-art before the transformer revolution that enabled today's contextual embeddings.

Practice Exercises

Contextual Analysis:
- Compare how BERT and Word2Vec handle ambiguous words in different contexts
- Visualize the difference using dimensionality reduction techniques
Embedding-Based Semantic Search:
- Build a simple semantic search engine using Sentence-BERT
- Compare its performance with keyword-based search
Zero-Shot Classification:
- Implement a zero-shot classifier using pre-trained embeddings
- Evaluate its performance on a dataset without fine-tuning
Cross-Lingual Embeddings:
- Explore how multilingual models handle translation and cross-lingual tasks
- Test semantic similarity across languages

NLP Fundamentals: Core Concepts and Architectures

Contextual Embeddings and Modern Representations

Overview

Learning Objectives

The Need for Context

Why Static Embeddings Fall Short

The Breakthrough: Dynamic Representations

From Static to Dynamic Representations

The Evolution of Word Representations

ELMo: Embeddings from Language Models

Key Innovation

Architecture

Mathematical Formulation

Layer Specialization

Visualizing ELMo's Contextual Representations

Word Sense Disambiguation

Embedding Space Visualization

Legend:

Example Contexts

Contextual vs. Static Embeddings

Other Ambiguous Words

BERT: Bidirectional Encoder Representations from Transformers

Key Innovations

Architecture

Pre-training Tasks

BERT Variants

Visualizing BERT Attention

RoBERTa and Improvements on BERT

The Embedding Benchmarking Revolution

MTEB: Massive Text Embedding Benchmark

MTEB Leaderboard Performance

Sentence-BERT: Efficient Sentence Embeddings

Key Innovations

Practical Applications

Code Example: Using Sentence Transformers

Beyond Text: CLIP and Multimodal Embeddings

How CLIP Works

Contrastive Pre-training

CLIP Applications

Visualizing CLIP's Joint Embedding Space

Embedding Models Comparison

Select Models (up to 3)

Model Features

Select Word to Compare

Similar Words by Model

Key Differences

State-of-the-Art Embedding Models

E5 Family (Microsoft)

BGE (BAAI)

GTE (Alibaba)

Why Are Contextual Embeddings Better?

1. Word Sense Disambiguation

2. Handling Polysemy and Homonyms

3. Capturing Syntactic Roles

4. Handling Co-reference

5. Incorporating World Knowledge

Practical Applications of Contextual Embeddings

Semantic Search

Zero-shot Classification

Limitations and Challenges

The Future of Embeddings

Emerging Trends

The Efficiency Revolution

Summary

Practice Exercises

Additional Resources