Word Embeddings: From Word2Vec to FastText

Overview

In our previous lessons, we explored how to preprocess text and tokenize it into meaningful units. While these are crucial steps, they still don't solve a fundamental challenge in NLP: how do we represent words in a way that captures their meaning and relationships?

This lesson introduces word embeddings - dense vector representations that encode semantic relationships between words. These representations revolutionized NLP by enabling machines to understand semantic similarity, analogies, and other relationships between words that were previously difficult to capture.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the limitations of traditional one-hot encoding for word representation
  • Explain the intuition and theory behind word embeddings
  • Differentiate between Word2Vec approaches (CBOW and Skip-gram)
  • Understand how GloVe captures global statistics
  • Recognize how FastText handles subword information
  • Implement and use pre-trained word embeddings in practical applications

The Challenge of Word Representation

One-Hot Encoding: A Starting Point

Before embeddings, the standard approach to represent words was one-hot encoding:

"cat" → [0, 0, 1, 0, 0, ... 0] "dog" → [0, 0, 0, 1, 0, ... 0]

In a one-hot encoding, each word gets a unique position in a very high-dimensional vector (the size of your vocabulary). Only one element is "hot" (set to 1), and all others are 0.

Limitations of One-Hot Encoding

  1. Dimensionality: For a vocabulary of 50,000 words, each vector has 50,000 dimensions but only contains a single piece of information.

  2. No Semantic Information: "cat" and "kitten" are as different as "cat" and "spacecraft" - all word pairs are equidistant.

  3. No Generalization: A model can't transfer knowledge between similar words.

Analogy: Library with No Organization

Imagine a library where books are simply assigned arbitrary shelf numbers without any organizing principle. Similar books might be placed on opposite ends of the building. Finding related content would require memorizing each book's exact location, with no way to guess where related titles might be.

Word embeddings are like organizing this library topically, where similar books are placed near each other, allowing you to browse naturally based on subject matter.

Distributional Semantics: The Foundation

The theoretical foundation for word embeddings comes from distributional semantics, captured in J.R. Firth's famous quote:

"You shall know a word by the company it keeps."

This idea suggests that words appearing in similar contexts likely have similar meanings. For example, "cat" and "dog" often appear near words like "pet," "animal," "fur," etc.

Visualizing the Distributional Hypothesis

Word Context Explorer

This tool visualizes how words appear in different contexts, demonstrating the distributional hypothesis: "You shall know a word by the company it keeps." Select a word to see different contexts and meanings.

Contexts for "bank"

financial

I need to go to the bank to deposit my paycheck.

geographical

We sat on the bank of the river watching boats go by.

action

The pilot had to bank the aircraft sharply to avoid the mountain.

financial

The bank approved my mortgage application yesterday.

geographical

The bank of the river had eroded after the heavy rain.

The Distributional Hypothesis

Notice how the word "bank" has different meanings in different contexts. Word embeddings capture these contextual patterns by analyzing millions of examples, placing words that appear in similar contexts closer together in the vector space.

This interactive tool demonstrates how words like "bank" appear in different contexts with different meanings, supporting Firth's distributional hypothesis.

Word2Vec: Making Words Computable

In 2013, Tomas Mikolov and colleagues at Google introduced Word2Vec, a groundbreaking approach to learning word representations from large text corpora.

The Word2Vec Intuition

Word2Vec transforms words into dense vectors (typically 100-300 dimensions) where:

  1. Similar words are close together in vector space
  2. Relationships between words are preserved as vector operations
  3. Different aspects of meaning are captured in different dimensions

Two Architecture Variants

Word2Vec comes in two flavors:

  1. Continuous Bag of Words (CBOW): Predicts a target word from its context words
  2. Skip-gram: Predicts context words from a target word

Continuous Bag of Words (CBOW)

CBOW predicts a target word given its surrounding context words.

Architecture

  1. Context words are one-hot encoded
  2. These encodings are projected through a shared weight matrix
  3. The projections are averaged
  4. The result passes through an output layer to predict the target word

Mathematical Formulation

For a target word wtw_t and context words wtn,...,wt1,wt+1,...,wt+nw_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}:

  1. Input layer: One-hot vectors xtn,...,xt1,xt+1,...,xt+n\mathbf{x}_{t-n}, ..., \mathbf{x}_{t-1}, \mathbf{x}_{t+1}, ..., \mathbf{x}_{t+n}
  2. Hidden layer: h=12nWT(xtn+...+xt1+xt+1+...+xt+n)\mathbf{h} = \frac{1}{2n}\mathbf{W}^T(\mathbf{x}_{t-n} + ... + \mathbf{x}_{t-1} + \mathbf{x}_{t+1} + ... + \mathbf{x}_{t+n})
  3. Output layer: uj=vjTh\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h} for each word jj in vocabulary
  4. Softmax: p(wjwtn,...,wt1,wt+1,...,wt+n)=exp(uj)j=1Vexp(uj)p(w_j|w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}) = \frac{\exp(u_j)}{\sum_{j'=1}^{V} \exp(u_{j'})}

Where W\mathbf{W} and V\mathbf{V'} are the input-to-hidden and hidden-to-output weight matrices.

Skip-gram

Skip-gram is the reverse of CBOW: it predicts context words given a target word.

Architecture

  1. Target word is one-hot encoded
  2. This encoding is projected through a weight matrix
  3. The result is used to predict each context word independently

Mathematical Formulation

For a target word wtw_t and context words wtn,...,wt1,wt+1,...,wt+nw_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}:

  1. Input layer: One-hot vector xt\mathbf{x}_t
  2. Hidden layer: h=WTxt\mathbf{h} = \mathbf{W}^T\mathbf{x}_t
  3. Output layer: uj=vjTh\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h} for each word jj in vocabulary
  4. Softmax: For each position in context window, calculate p(wt+iwt)=exp(uwt+i)j=1Vexp(uj)p(w_{t+i}|w_t) = \frac{\exp(u_{w_{t+i}})}{\sum_{j=1}^{V} \exp(u_j)}

Visual Comparison: CBOW vs Skip-gram

Word2Vec Architecture Explorer

Explore the two main architecture variants of Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram. See how they differ in structure, training process, and applications.

Continuous Bag of Words (CBOW) Architecture

Input
Context Words
w(t-2)
w(t-1)
w(t+1)
w(t+2)
Projection
Layer
Shared Weights
Average of
context vectors
Output
Target Word
w(t)
CBOW Objective:
Predict w(t) from [w(t-2), w(t-1), w(t+1), w(t+2)]

Key Idea: CBOW predicts a target word based on its context words, using an average of the context word vectors.

Mathematical Objective: Maximize the probability of the target word given the context words.

This interactive visualization shows the architectural differences between CBOW and Skip-gram, helping you understand when to use each approach.

Training Optimizations

Computing the full softmax for large vocabularies (e.g., millions of words) is computationally expensive. Two main optimization techniques are used:

  1. Hierarchical Softmax: Uses a binary tree structure to reduce complexity from O(V) to O(log V)
  2. Negative Sampling: Updates only a small subset of weights in each iteration

Negative Sampling Explained

Instead of updating all output neurons, negative sampling:

  1. Updates the weights for the correct output
  2. Updates weights for a few randomly chosen "negative" outputs
  3. Significantly speeds up training

The objective function becomes:

logσ(vwOTvwI)+i=1kEwiPn(w)[logσ(vwiTvwI)]\log \sigma(v_{w_O}^T \cdot v_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-v_{w_i}^T \cdot v_{w_I})]

Where:

  • vwIv_{w_I} is the input vector for the target word
  • vwOv_{w_O} is the output vector for the context word
  • wiw_i are the negative samples
  • σ\sigma is the sigmoid function
  • kk is the number of negative samples (typically 5-20)

CBOW vs Skip-gram: When to Use Each

FeatureCBOWSkip-gram
Training speedFasterSlower
Performance on frequent wordsBetterGood
Performance on rare wordsWorseBetter
Small training corpusBetterWorse
Large training corpusGoodBetter
Captures multiple word sensesLimitedBetter

Note: Skip-gram generally produces better quality embeddings but is more computationally expensive.

Interactive Word2Vec Explorer

Word2Vec Explorer

Explore word embeddings and their relationships in vector space. Select a word to see similar words and visualize vector relationships.

Vector Arithmetic

Example: king - man + woman ≈ queen

Explore word similarities, find nearest neighbors, and perform vector arithmetic with this interactive Word2Vec playground.

Word Analogies: Vector Arithmetic

One of the most fascinating properties of word embeddings is their ability to capture linguistic regularities through vector arithmetic.

The Famous Example

vec("king")vec("man")+vec("woman")vec("queen")\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"})

This shows how the model captures gender relationships between words.

Other Analogies

Interactive Analogy Explorer

Word Analogies Explorer

Explore how word analogies work in embedding space. Word embeddings can perform arithmetic operations like: king - man + woman ≈ queen

First Pair
king
is to
queen
Second Pair
man
is to
woman

Vector Calculation

queen-king+manwoman
Similarity:99.7%

Create Your Own Analogy

is to
is to

Test the famous "king - man + woman = queen" relationship and explore other analogical relationships captured by word embeddings.

GloVe: Global Vectors for Word Representation

While Word2Vec learns from local context windows, GloVe (Global Vectors) incorporates global statistics about word co-occurrences across the entire corpus.

GloVe's Approach

GloVe combines the advantages of two paradigms:

  1. Matrix factorization methods like LSA (captures global statistics)
  2. Local context window methods like Word2Vec (captures local context)

GloVe's Mathematical Foundation

GloVe trains on global word-word co-occurrence statistics from a corpus. The objective function is:

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij})(\mathbf{w}_i^T\mathbf{\tilde{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2

Where:

  • XijX_{ij} is the number of times word jj appears in the context of word ii
  • wi\mathbf{w}_i and w~j\mathbf{\tilde{w}}_j are word vectors
  • bib_i and b~j\tilde{b}_j are bias terms
  • f(Xij)f(X_{ij}) is a weighting function that gives less weight to rare co-occurrences

GloVe vs Word2Vec

FeatureWord2VecGloVe
Learning mechanismPredictive (neural network)Count-based (matrix factorization)
Training contextLocal sliding windowGlobal co-occurrence statistics
Training efficiencyRequires many passesConverges faster
ParallelizableLess parallelizableHighly parallelizable
Performance on analogiesGoodSlightly better
Captures rare co-occurrencesMay miss themCaptures global patterns

FastText: Improving with Subword Information

FastText, developed by Facebook Research, extends Word2Vec by incorporating subword information, addressing a major limitation of previous models: handling out-of-vocabulary and rare words.

The Subword Approach

While Word2Vec and GloVe treat each word as an atomic unit, FastText represents each word as a bag of character n-grams plus the whole word.

For example, the word "where" with n-grams of length 3-6 would be represented as:

  • Whole word: "where"
  • Character n-grams: <wh, whe, her, ere, re>, <whe, wher, here, ere>, <wher, where, here>, <where, where>

(Note: < and > are special boundary symbols)

Mathematical Formulation

In FastText, a word's embedding is the sum of its character n-gram embeddings:

vw=gGwzg\mathbf{v}_w = \sum_{g \in G_w} \mathbf{z}_g

Where:

  • GwG_w is the set of n-grams appearing in word ww
  • zg\mathbf{z}_g is the vector representation of n-gram gg

Benefits of FastText

  1. Handles out-of-vocabulary words: Can generate embeddings for words never seen during training
  2. Better for morphologically rich languages: Captures prefixes, suffixes, and roots
  3. Robust to misspellings: Similar spellings result in similar embeddings
  4. Smaller models: Can represent larger vocabularies efficiently

Interactive FastText vs Word2Vec Comparison

Embedding Models Comparison

Compare how different embedding models represent words and their relationships. Select models and words to see how the results differ across approaches.

Select Models (up to 3)

Word2Vec (2013)
OOV:
No
Subword:
No
Context:
No
GloVe (2014)
OOV:
No
Subword:
No
Context:
No
FastText (2016)
OOV:
Yes
Subword:
Yes
Context:
No
ELMo (2018)
OOV:
Limited
Subword:
Yes
Context:
Yes
BERT (2018)
OOV:
Limited
Subword:
Yes
Context:
Yes

Model Features

Out-of-Vocabulary (OOV) Handling

The ability to generate embeddings for words not seen during training.

Subword Information

Utilizing character n-grams or other subword features to build word representations.

Contextual Awareness

Whether the model generates different representations for the same word in different contexts.

Select Word to Compare

Similar Words by Model

ModelTop Similar Words
Word2Vec
2013
accountmoneyloanfinancialcredit
FastText
2016
banksbankingbankermoneyfinancial

Key Differences

  • Word2Vec and GloVe use whole-word vectors, making them struggle with rare words.
  • FastText adds subword information, improving handling of morphologically rich languages and typos.
  • ELMo and BERT create contextualized embeddings that change based on surrounding words.
  • Notice how models prioritize different relationships (semantic vs. syntactic) in their similar words.

Compare how FastText handles out-of-vocabulary words and subword information versus traditional Word2Vec approaches.

Analogy: Character-Based Recognition

Think of how humans recognize related words. If you've never seen the word "unhappiness" but know "happy," "unhappy," and "happiness," you can deduce its meaning from its parts. FastText follows a similar principle, building word meaning from component parts.

Practical Implementation

Using Word2Vec with Gensim

python
import gensim.downloader as api from gensim.models import Word2Vec import numpy as np # Load pre-trained model word2vec_model = api.load('word2vec-google-news-300') # Find similar words similar_words = word2vec_model.most_similar('computer', topn=5) print("Words similar to 'computer':")

Using GloVe with Python

python
import numpy as np from gensim.models import KeyedVectors import urllib.request import os import zipfile # Download and extract GloVe vectors glove_url = "http://nlp.stanford.edu/data/glove.6B.zip" glove_path = "glove.6B.zip"

Using FastText

python
import fasttext import fasttext.util # Download pre-trained FastText model fasttext.util.download_model('en', if_exists='ignore') # Load the model ft_model = fasttext.load_model('cc.en.300.bin') # Reduce model dimensions for faster processing (optional)

Evaluating Word Embeddings

Intrinsic Evaluation

  1. Word Similarity: How well do embedding distances correlate with human judgments?

    • WordSim-353, SimLex-999, MEN datasets
  2. Word Analogies: How well do embeddings capture relationships?

    • Google analogy dataset (semantic and syntactic analogies)

Extrinsic Evaluation

Test performance on downstream tasks:

  • Named Entity Recognition
  • Sentiment Analysis
  • Part-of-Speech Tagging

Visualization of Evaluation Metrics

Embedding Performance Comparison:

  • Word2Vec: Good performance across most metrics
  • GloVe: Slightly better on analogies and word similarity
  • FastText: Best performance on rare words and morphologically rich tasks

Limitations of Traditional Word Embeddings

Despite their revolutionary impact, traditional word embeddings have several limitations:

  1. Static Word Representations: Each word has a single vector, regardless of context

    • "bank" has the same representation in "river bank" and "bank account"
  2. Limited Compositional Understanding: Poor at representing phrases and sentences

  3. Bias and Fairness Issues: Embeddings learn and amplify biases in training data

    • Example: "man : doctor :: woman : nurse"
  4. Requires Large Corpora: Need substantial training data for good quality

Visualizing Contextual Ambiguity

Word Sense Disambiguation

This visualization shows how contextual embeddings position the same word differently based on its meaning in context.

Embedding Space Visualization

Dimension 1
Dimension 2
Legend:
bank (financial)
bank (river)
bank (verb)

Example Contexts

bank (financial)

I deposited money into my bank account yesterday.

bank (river)

We sat on the bank of the river watching boats pass by.

bank (verb)

The pilot had to bank the aircraft sharply to avoid the mountain.

Contextual vs. Static Embeddings

Traditional word embeddings like Word2Vec assign the same vector to a word regardless of context. Contextual embeddings like ELMo and BERT create different vectors based on the surrounding words, allowing them to distinguish between different meanings of the same word.

Other Ambiguous Words

bank
financial institution • river edge • to tilt
light
not heavy • brightness • to ignite
run
to move quickly • to operate • a series
spring
season • coiled metal • water source
bear
animal • to endure • stock market term

This tool illustrates how traditional embeddings assign the same vector to words like "bank" regardless of whether it means a financial institution or a river bank.

Summary

In this lesson, we've covered:

  1. The evolution from sparse to dense word representations
  2. Word2Vec approaches: CBOW and Skip-gram
  3. GloVe's incorporation of global statistics
  4. FastText's handling of subword information
  5. Practical implementations of word embedding models
  6. Limitations of traditional embedding approaches

These foundational models revolutionized NLP by transforming words into rich, meaningful vector spaces. However, they represent just the beginning of the embedding journey.

In our next lesson, we'll explore contextual embeddings from models like ELMo, BERT, and modern language models, which address many limitations of the traditional approaches we've covered here.

Practice Exercises

  1. Word Embedding Exploration:

    • Download pre-trained Word2Vec, GloVe, and FastText models
    • Compare their performance on a set of word analogies
    • Visualize word clusters in 2D using dimensionality reduction
  2. Training Custom Embeddings:

    • Train Word2Vec and FastText embeddings on a domain-specific corpus
    • Compare their performance against general pre-trained models
    • Analyze how domain focus affects quality
  3. Word Similarity Application:

    • Build a simple document similarity system using word embeddings
    • Create an average-of-embeddings representation for sentences
    • Compute distances between documents
  4. Embedding Bias Analysis:

    • Investigate gender, racial, or other biases in pre-trained embeddings
    • Implement a simple debiasing approach
    • Measure the impact of debiasing on analogy tasks

Additional Resources