Word Embeddings: From Word2Vec to FastText

Overview

In our previous lessons, we explored how to preprocess text and tokenize it into meaningful units. While these are crucial steps, they still don't solve a fundamental challenge in NLP: how do we represent words in a way that captures their meaning and relationships?

This lesson introduces word embeddings - dense vector representations that encode semantic relationships between words. These representations revolutionized NLP by enabling machines to understand semantic similarity, analogies, and other relationships between words that were previously difficult to capture.

Learning Objectives

After completing this lesson, you will be able to:

Understand the limitations of traditional one-hot encoding for word representation
Explain the intuition and theory behind word embeddings
Differentiate between Word2Vec approaches (CBOW and Skip-gram)
Understand how GloVe captures global statistics
Recognize how FastText handles subword information
Implement and use pre-trained word embeddings in practical applications

The Challenge of Word Representation

One-Hot Encoding: A Starting Point

Before embeddings, the standard approach to represent words was one-hot encoding:

"cat" → [0, 0, 1, 0, 0, ... 0]
"dog" → [0, 0, 0, 1, 0, ... 0]

In a one-hot encoding, each word gets a unique position in a very high-dimensional vector (the size of your vocabulary). Only one element is "hot" (set to 1), and all others are 0.

Limitations of One-Hot Encoding

Dimensionality: For a vocabulary of 50,000 words, each vector has 50,000 dimensions but only contains a single piece of information.
No Semantic Information: "cat" and "kitten" are as different as "cat" and "spacecraft" - all word pairs are equidistant.
No Generalization: A model can't transfer knowledge between similar words.

Analogy: Library with No Organization

Imagine a library where books are simply assigned arbitrary shelf numbers without any organizing principle. Similar books might be placed on opposite ends of the building. Finding related content would require memorizing each book's exact location, with no way to guess where related titles might be.

Word embeddings are like organizing this library topically, where similar books are placed near each other, allowing you to browse naturally based on subject matter.

Distributional Semantics: The Foundation

The theoretical foundation for word embeddings comes from distributional semantics, captured in J.R. Firth's famous quote:

"You shall know a word by the company it keeps."

This idea suggests that words appearing in similar contexts likely have similar meanings. For example, "cat" and "dog" often appear near words like "pet," "animal," "fur," etc.

Visualizing the Distributional Hypothesis

Word Context Explorer

This tool visualizes how words appear in different contexts, demonstrating the distributional hypothesis: "You shall know a word by the company it keeps." Select a word to see different contexts and meanings.

Contexts for "bank"

financial

I need to go to the bank to deposit my paycheck.

geographical

We sat on the bank of the river watching boats go by.

action

The pilot had to bank the aircraft sharply to avoid the mountain.

financial

The bank approved my mortgage application yesterday.

geographical

The bank of the river had eroded after the heavy rain.

The Distributional Hypothesis

Notice how the word "bank" has different meanings in different contexts. Word embeddings capture these contextual patterns by analyzing millions of examples, placing words that appear in similar contexts closer together in the vector space.

This interactive tool demonstrates how words like "bank" appear in different contexts with different meanings, supporting Firth's distributional hypothesis.

Word2Vec: Making Words Computable

In 2013, Tomas Mikolov and colleagues at Google introduced Word2Vec, a groundbreaking approach to learning word representations from large text corpora.

The Word2Vec Intuition

Word2Vec transforms words into dense vectors (typically 100-300 dimensions) where:

Similar words are close together in vector space
Relationships between words are preserved as vector operations
Different aspects of meaning are captured in different dimensions

Two Architecture Variants

Word2Vec comes in two flavors:

Continuous Bag of Words (CBOW): Predicts a target word from its context words
Skip-gram: Predicts context words from a target word

Continuous Bag of Words (CBOW)

CBOW predicts a target word given its surrounding context words.

Architecture

Context words are one-hot encoded
These encodings are projected through a shared weight matrix
The projections are averaged
The result passes through an output layer to predict the target word

Mathematical Formulation

For a target word $w_t$ and context words $w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}$ :

Input layer: One-hot vectors $\mathbf{x}_{t-n}, ..., \mathbf{x}_{t-1}, \mathbf{x}_{t+1}, ..., \mathbf{x}_{t+n}$
Hidden layer: $\mathbf{h} = \frac{1}{2n}\mathbf{W}^T(\mathbf{x}_{t-n} + ... + \mathbf{x}_{t-1} + \mathbf{x}_{t+1} + ... + \mathbf{x}_{t+n})$
Output layer: $\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h}$ for each word $j$ in vocabulary
Softmax: $p(w_j|w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}) = \frac{\exp(u_j)}{\sum_{j'=1}^{V} \exp(u_{j'})}$

Where $\mathbf{W}$ and $\mathbf{V'}$ are the input-to-hidden and hidden-to-output weight matrices.

Skip-gram

Skip-gram is the reverse of CBOW: it predicts context words given a target word.

Architecture

Target word is one-hot encoded
This encoding is projected through a weight matrix
The result is used to predict each context word independently

Mathematical Formulation

For a target word $w_t$ and context words $w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}$ :

Input layer: One-hot vector $\mathbf{x}_t$
Hidden layer: $\mathbf{h} = \mathbf{W}^T\mathbf{x}_t$
Output layer: $\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h}$ for each word $j$ in vocabulary
Softmax: For each position in context window, calculate $p(w_{t+i}|w_t) = \frac{\exp(u_{w_{t+i}})}{\sum_{j=1}^{V} \exp(u_j)}$

Visual Comparison: CBOW vs Skip-gram

Word2Vec Architecture Explorer

Explore the two main architecture variants of Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram. See how they differ in structure, training process, and applications.

Continuous Bag of Words (CBOW) Architecture

Input
Context Words

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Projection
Layer

Shared Weights

Average of
context vectors

Output
Target Word

w(t)

CBOW Objective:

Predict w(t) from [w(t-2), w(t-1), w(t+1), w(t+2)]

Key Idea: CBOW predicts a target word based on its context words, using an average of the context word vectors.

Mathematical Objective: Maximize the probability of the target word given the context words.

This interactive visualization shows the architectural differences between CBOW and Skip-gram, helping you understand when to use each approach.

Training Optimizations

Computing the full softmax for large vocabularies (e.g., millions of words) is computationally expensive. Two main optimization techniques are used:

Hierarchical Softmax: Uses a binary tree structure to reduce complexity from O(V) to O(log V)
Negative Sampling: Updates only a small subset of weights in each iteration

Negative Sampling Explained

Instead of updating all output neurons, negative sampling:

Updates the weights for the correct output
Updates weights for a few randomly chosen "negative" outputs
Significantly speeds up training

The objective function becomes:

$\log \sigma(v_{w_O}^T \cdot v_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-v_{w_i}^T \cdot v_{w_I})]$

Where:

$v_{w_I}$ is the input vector for the target word
$v_{w_O}$ is the output vector for the context word
$w_i$ are the negative samples
$\sigma$ is the sigmoid function
$k$ is the number of negative samples (typically 5-20)

CBOW vs Skip-gram: When to Use Each

Feature	CBOW	Skip-gram
Training speed	Faster	Slower
Performance on frequent words	Better	Good
Performance on rare words	Worse	Better
Small training corpus	Better	Worse
Large training corpus	Good	Better
Captures multiple word senses	Limited	Better

Note: Skip-gram generally produces better quality embeddings but is more computationally expensive.

Interactive Word2Vec Explorer

Word2Vec Explorer

Explore word embeddings and their relationships in vector space. Select a word to see similar words and visualize vector relationships.

Select Word:

Vector Arithmetic

Example: king - man + woman ≈ queen

Add Vector:

Subtract Vector:

Explore word similarities, find nearest neighbors, and perform vector arithmetic with this interactive Word2Vec playground.

Word Analogies: Vector Arithmetic

One of the most fascinating properties of word embeddings is their ability to capture linguistic regularities through vector arithmetic.

The Famous Example

$\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"})$

This shows how the model captures gender relationships between words.

Other Analogies

Interactive Analogy Explorer

Word Analogies Explorer

Explore how word analogies work in embedding space. Word embeddings can perform arithmetic operations like: king - man + woman ≈ queen

First Pair

king

is to

queen

Second Pair

man

is to

woman

Vector Calculation

queen-king+man≈woman

Similarity:99.7%

Create Your Own Analogy

First Pair

is to

Second Pair

is to

Test the famous "king - man + woman = queen" relationship and explore other analogical relationships captured by word embeddings.

GloVe: Global Vectors for Word Representation

While Word2Vec learns from local context windows, GloVe (Global Vectors) incorporates global statistics about word co-occurrences across the entire corpus.

GloVe's Approach

GloVe combines the advantages of two paradigms:

Matrix factorization methods like LSA (captures global statistics)
Local context window methods like Word2Vec (captures local context)

GloVe's Mathematical Foundation

GloVe trains on global word-word co-occurrence statistics from a corpus. The objective function is:

$J = \sum_{i,j=1}^{V} f(X_{ij})(\mathbf{w}_i^T\mathbf{\tilde{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2$

Where:

$X_{ij}$ is the number of times word $j$ appears in the context of word $i$
$\mathbf{w}_i$ and $\mathbf{\tilde{w}}_j$ are word vectors
$b_i$ and $\tilde{b}_j$ are bias terms
$f(X_{ij})$ is a weighting function that gives less weight to rare co-occurrences

GloVe vs Word2Vec

Feature	Word2Vec	GloVe
Learning mechanism	Predictive (neural network)	Count-based (matrix factorization)
Training context	Local sliding window	Global co-occurrence statistics
Training efficiency	Requires many passes	Converges faster
Parallelizable	Less parallelizable	Highly parallelizable
Performance on analogies	Good	Slightly better
Captures rare co-occurrences	May miss them	Captures global patterns

FastText: Improving with Subword Information

FastText, developed by Facebook Research, extends Word2Vec by incorporating subword information, addressing a major limitation of previous models: handling out-of-vocabulary and rare words.

The Subword Approach

While Word2Vec and GloVe treat each word as an atomic unit, FastText represents each word as a bag of character n-grams plus the whole word.

For example, the word "where" with n-grams of length 3-6 would be represented as:

Whole word: "where"
Character n-grams: <wh, whe, her, ere, re>, <whe, wher, here, ere>, <wher, where, here>, <where, where>

(Note: < and > are special boundary symbols)

Mathematical Formulation

In FastText, a word's embedding is the sum of its character n-gram embeddings:

$\mathbf{v}_w = \sum_{g \in G_w} \mathbf{z}_g$

Where:

$G_w$ is the set of n-grams appearing in word $w$
$\mathbf{z}_g$ is the vector representation of n-gram $g$

Benefits of FastText

Handles out-of-vocabulary words: Can generate embeddings for words never seen during training
Better for morphologically rich languages: Captures prefixes, suffixes, and roots
Robust to misspellings: Similar spellings result in similar embeddings
Smaller models: Can represent larger vocabularies efficiently

Interactive FastText vs Word2Vec Comparison

Embedding Models Comparison

Compare how different embedding models represent words and their relationships. Select models and words to see how the results differ across approaches.

Select Models (up to 3)

Word2Vec (2013)

OOV:

Subword:

Context:

GloVe (2014)

OOV:

Subword:

Context:

FastText (2016)

OOV:

Yes

Subword:

Yes

Context:

ELMo (2018)

OOV:

Limited

Subword:

Yes

Context:

Yes

BERT (2018)

OOV:

Limited

Subword:

Yes

Context:

Yes

Model Features

Out-of-Vocabulary (OOV) Handling

The ability to generate embeddings for words not seen during training.

Subword Information

Utilizing character n-grams or other subword features to build word representations.

Contextual Awareness

Whether the model generates different representations for the same word in different contexts.

Select Word to Compare

Similar Words by Model

Model	Top Similar Words
Word2Vec 2013	accountmoneyloanfinancialcredit
FastText 2016	banksbankingbankermoneyfinancial

Key Differences

Word2Vec and GloVe use whole-word vectors, making them struggle with rare words.
FastText adds subword information, improving handling of morphologically rich languages and typos.
ELMo and BERT create contextualized embeddings that change based on surrounding words.
Notice how models prioritize different relationships (semantic vs. syntactic) in their similar words.

Compare how FastText handles out-of-vocabulary words and subword information versus traditional Word2Vec approaches.

Analogy: Character-Based Recognition

Think of how humans recognize related words. If you've never seen the word "unhappiness" but know "happy," "unhappy," and "happiness," you can deduce its meaning from its parts. FastText follows a similar principle, building word meaning from component parts.

Practical Implementation

Using Word2Vec with Gensim

python
import gensim.downloader as api
from gensim.models import Word2Vec
import numpy as np

# Load pre-trained model
word2vec_model = api.load('word2vec-google-news-300')

# Find similar words
similar_words = word2vec_model.most_similar('computer', topn=5)
print("Words similar to 'computer':")

Using GloVe with Python

python
import numpy as np
from gensim.models import KeyedVectors
import urllib.request
import os
import zipfile

# Download and extract GloVe vectors
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_path = "glove.6B.zip"

Using FastText

python
import fasttext
import fasttext.util

# Download pre-trained FastText model
fasttext.util.download_model('en', if_exists='ignore')

# Load the model
ft_model = fasttext.load_model('cc.en.300.bin')

# Reduce model dimensions for faster processing (optional)

Evaluating Word Embeddings

Intrinsic Evaluation

Word Similarity: How well do embedding distances correlate with human judgments?
- WordSim-353, SimLex-999, MEN datasets
Word Analogies: How well do embeddings capture relationships?
- Google analogy dataset (semantic and syntactic analogies)

Extrinsic Evaluation

Test performance on downstream tasks:

Named Entity Recognition
Sentiment Analysis
Part-of-Speech Tagging

Visualization of Evaluation Metrics

Embedding Performance Comparison:

Word2Vec: Good performance across most metrics
GloVe: Slightly better on analogies and word similarity
FastText: Best performance on rare words and morphologically rich tasks

Limitations of Traditional Word Embeddings

Despite their revolutionary impact, traditional word embeddings have several limitations:

Static Word Representations: Each word has a single vector, regardless of context
- "bank" has the same representation in "river bank" and "bank account"
Limited Compositional Understanding: Poor at representing phrases and sentences
Bias and Fairness Issues: Embeddings learn and amplify biases in training data
- Example: "man : doctor :: woman : nurse"
Requires Large Corpora: Need substantial training data for good quality

Visualizing Contextual Ambiguity

Word Sense Disambiguation

This visualization shows how contextual embeddings position the same word differently based on its meaning in context.

Embedding Space Visualization

Dimension 1

Dimension 2

Legend:

bank (financial)

bank (river)

bank (verb)

Example Contexts

bank (financial)

I deposited money into my bank account yesterday.

bank (river)

We sat on the bank of the river watching boats pass by.

bank (verb)

The pilot had to bank the aircraft sharply to avoid the mountain.

Contextual vs. Static Embeddings

Traditional word embeddings like Word2Vec assign the same vector to a word regardless of context. Contextual embeddings like ELMo and BERT create different vectors based on the surrounding words, allowing them to distinguish between different meanings of the same word.

Other Ambiguous Words

bank

financial institution • river edge • to tilt

light

not heavy • brightness • to ignite

run

to move quickly • to operate • a series

spring

season • coiled metal • water source

bear

animal • to endure • stock market term

This tool illustrates how traditional embeddings assign the same vector to words like "bank" regardless of whether it means a financial institution or a river bank.

Summary

In this lesson, we've covered:

The evolution from sparse to dense word representations
Word2Vec approaches: CBOW and Skip-gram
GloVe's incorporation of global statistics
FastText's handling of subword information
Practical implementations of word embedding models
Limitations of traditional embedding approaches

These foundational models revolutionized NLP by transforming words into rich, meaningful vector spaces. However, they represent just the beginning of the embedding journey.

In our next lesson, we'll explore contextual embeddings from models like ELMo, BERT, and modern language models, which address many limitations of the traditional approaches we've covered here.

Practice Exercises

Word Embedding Exploration:
- Download pre-trained Word2Vec, GloVe, and FastText models
- Compare their performance on a set of word analogies
- Visualize word clusters in 2D using dimensionality reduction
Training Custom Embeddings:
- Train Word2Vec and FastText embeddings on a domain-specific corpus
- Compare their performance against general pre-trained models
- Analyze how domain focus affects quality
Word Similarity Application:
- Build a simple document similarity system using word embeddings
- Create an average-of-embeddings representation for sentences
- Compute distances between documents
Embedding Bias Analysis:
- Investigate gender, racial, or other biases in pre-trained embeddings
- Implement a simple debiasing approach
- Measure the impact of debiasing on analogy tasks

Additional Resources

Word2Vec Paper: Efficient Estimation of Word Representations in Vector Space
GloVe Project at Stanford
FastText: Library for Efficient Text Classification and Representation Learning
Gensim: Topic Modelling for Humans
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Book: "Speech and Language Processing" by Dan Jurafsky and James H. Martin (Chapter on Vector Semantics)

NLP Fundamentals: Core Concepts and Architectures

Word Embeddings: From Word2Vec to FastText

Overview

Learning Objectives

The Challenge of Word Representation

One-Hot Encoding: A Starting Point

Limitations of One-Hot Encoding

Analogy: Library with No Organization

Distributional Semantics: The Foundation

Visualizing the Distributional Hypothesis

Word Context Explorer

Contexts for "bank"

The Distributional Hypothesis

Word2Vec: Making Words Computable

The Word2Vec Intuition

Two Architecture Variants

Continuous Bag of Words (CBOW)

Architecture

Mathematical Formulation

Skip-gram

Architecture

Mathematical Formulation

Visual Comparison: CBOW vs Skip-gram

Word2Vec Architecture Explorer

Continuous Bag of Words (CBOW) Architecture

Training Optimizations

Negative Sampling Explained

CBOW vs Skip-gram: When to Use Each

Interactive Word2Vec Explorer

Word2Vec Explorer

Vector Arithmetic

Word Analogies: Vector Arithmetic

The Famous Example

Other Analogies

Interactive Analogy Explorer

Word Analogies Explorer

Vector Calculation

Create Your Own Analogy

GloVe: Global Vectors for Word Representation

GloVe's Approach

GloVe's Mathematical Foundation

GloVe vs Word2Vec

FastText: Improving with Subword Information

The Subword Approach

Mathematical Formulation

Benefits of FastText

Interactive FastText vs Word2Vec Comparison

Embedding Models Comparison

Select Models (up to 3)

Model Features

Select Word to Compare

Similar Words by Model

Key Differences

Analogy: Character-Based Recognition

Practical Implementation

Using Word2Vec with Gensim

Using GloVe with Python

Using FastText

Evaluating Word Embeddings

Intrinsic Evaluation

Extrinsic Evaluation

Visualization of Evaluation Metrics

Limitations of Traditional Word Embeddings

Visualizing Contextual Ambiguity

Word Sense Disambiguation

Embedding Space Visualization

Legend:

Example Contexts

Contextual vs. Static Embeddings

Other Ambiguous Words

Summary

Practice Exercises

Additional Resources