Overview
In our previous lesson, we introduced basic tokenization methods like word and character tokenization. While intuitive, these approaches have significant limitations when handling large vocabularies, out-of-vocabulary words, and morphologically rich languages.
Modern NLP models rely on sophisticated subword tokenization strategies that find an optimal balance between character-level and word-level representations. Today's leading models use three main approaches:
- SentencePiece (Unigram): Dominant in 2024 - used by LLaMA, PaLM, T5, and most multilingual models
- Byte-Pair Encoding (BPE): Powers GPT models and many encoder-decoder architectures
- WordPiece: Foundation of BERT-family models and Google's ecosystem
This lesson explores these subword tokenization techniques that have revolutionized NLP, with hands-on tools to understand how each algorithm works in practice.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the limitations of traditional tokenization approaches
- Explain how modern subword tokenization algorithms work
- Compare different subword tokenization methods (BPE, WordPiece, SentencePiece)
- Implement and use subword tokenizers in practice
- Select appropriate tokenization strategies for different NLP tasks
The Need for Subword Tokenization
Limitations of Word-Level Tokenization
Word tokenization seemed intuitive in our previous lesson, but it has several critical weaknesses:
-
Vocabulary Explosion: Languages are productive — they can generate a virtually unlimited number of words through compounding, inflection, and derivation.
-
Out-of-Vocabulary (OOV) Words: Any word not seen during training becomes an
<UNK>
(unknown) token, losing all semantic information. -
Morphological Blindness: The tokens "play", "playing", and "played" are treated as completely different words, even though they share the same root.
-
Rare Words Problem: Infrequent words have sparse statistics, making it difficult for models to learn good representations.
Analogy: Word Construction as Lego Blocks
Think of words as structures built from smaller reusable pieces, like Lego blocks. Rather than trying to pre-manufacture every possible structure (word), we can provide the fundamental blocks and rules for combining them.
- In English: "un" + "break" + "able" = "unbreakable"
- In German: "Grund" + "gesetz" + "buch" = "Grundgesetzbuch" (constitution)
Visualization: Vocabulary Size vs. Coverage
The Vocabulary Size Problem
Vocabulary Size | Word-level Coverage | Subword Coverage (BPE) | Difference |
---|---|---|---|
10K tokens | 80.5% | 95.8% | +15.3% |
20K tokens | 85.2% | 97.9% | +12.7% |
30K tokens | 87.9% | 98.6% | +10.7% |
50K tokens | 90.5% | 99.2% | +8.7% |
100K tokens | 93.4% | 99.8% | +6.4% |
Key Insight: Subword tokenization achieves 95%+ coverage with just 10K tokens, while word-level tokenization needs 100K+ tokens to reach 93% coverage.
Why This Matters
- Memory Efficiency: Smaller vocabularies = smaller embedding matrices
- Better Generalization: Higher coverage means fewer
<UNK>
tokens - Computational Efficiency: Less vocabulary means faster training and inference
Byte-Pair Encoding (BPE)
BPE is one of the most widely used subword tokenization algorithms, employed by models like GPT (OpenAI) and BART (Facebook).
History and Origins
Originally developed as a data compression algorithm by Philip Gage in 1994, BPE was adapted for NLP by Rico Sennrich in 2016 for neural machine translation.
How BPE Works
BPE follows a simple yet effective procedure:
- Initialize vocabulary with individual characters
- Count all symbol pairs in the corpus
- Merge the most frequent pair
- Repeat steps 2-3 until desired vocabulary size or stopping criterion is reached
Step-by-Step BPE Algorithm Example
Let's trace through the BPE algorithm with the corpus: "low lower lowest"
Step 1: Initialize with characters + end-of-word marker
Text: low lower lowest Tokens: l o w </w> l o w e r </w> l o w e s t </w>
Step 2: Count all adjacent pairs
Pair counts: ('l', 'o'): 3 # appears in all three words ('o', 'w'): 3 # appears in all three words ('w', '</w>'): 1 # only in "low" ('w', 'e'): 2 # in "lower" and "lowest" ('e', 'r'): 1 # only in "lower" ('r', '</w>'): 1 # only in "lower" ('e', 's'): 1 # only in "lowest" ('s', 't'): 1 # only in "lowest" ('t', '</w>'): 1 # only in "lowest"
Step 3: Merge most frequent pair → ('l', 'o') becomes 'lo'
Text: low lower lowest Tokens: lo w </w> lo w e r </w> lo w e s t </w>
Step 4: Recount pairs
Pair counts: ('lo', 'w'): 3 # now most frequent ('w', '</w>'): 1 ('w', 'e'): 2 ('e', 'r'): 1 ... (other pairs)
Step 5: Merge ('lo', 'w') → 'low'
Text: low lower lowest Tokens: low </w> low e r </w> low e s t </w>
Step 6: Continue merging...
After merging ('e', 'r'): Tokens: low </w> low er </w> low e s t </w> After merging ('e', 's'): Tokens: low </w> low er </w> low es t </w>
Final vocabulary: {l, o, w, e, r, s, t, </w>, lo, low, er, es, ...}
Key Insights from this Example
- Frequency drives merging: Most common character pairs get merged first
- Hierarchical building: Simple subwords become building blocks for complex ones
- Shared subwords: "low" appears in all variants, maximizing reuse
- Morphology awareness: Suffixes like "er", "es", "est" emerge naturally
Interactive BPE Algorithm Explorer
BPE Algorithm Visualizer
Current Tokens (Step 0):
Vocabulary (0 tokens):
Note: The tool above should demonstrate the character-level initialization and frequency-based pair merging. If it's not working as expected, please use the step-by-step explanation above to understand the true BPE process.
Python Implementation
Here's a simplified implementation of BPE training:
pythonfrom collections import Counter import re def get_stats(vocab): pairs = Counter() for word, freq in vocab.items(): symbols = word.split() for i in range(len(symbols)-1): pairs[symbols[i], symbols[i+1]] += freq return pairs
Applications of BPE
- OpenAI's GPT models (GPT-2, GPT-3, GPT-4)
- Facebook's BART and RoBERTa
- Hugging Face's Tokenizers library
BPE Training Algorithm Summary
The complete BPE training process:
- Corpus Preparation: Split text into words, then into characters
- Iterative Merging:
- Count all adjacent symbol pairs
- Merge the most frequent pair
- Update the corpus with merged symbols
- Repeat until desired vocabulary size
- Vocabulary Creation: Final set contains original characters + learned merges
BPE vs. Word Tokenization: Realistic Comparison
Scenario: 30K vocabulary trained on general text, encountering new domain-specific text
Input Text | Word Tokenization | BPE Tokenization | Why BPE Helps |
---|---|---|---|
"COVID-19" | ["<UNK>"] | ["COVID", "-", "19"] | New compound word |
"unbreakable" | ["<UNK>"] | ["un", "break", "able"] | Morphology preserved |
"machine" | ["machine"] ✅ | ["machine"] ✅ | Common word in both |
"BiLSTM" | ["<UNK>"] | ["Bi", "LSTM"] | Technical abbreviation |
"fine-tuning" | ["fine", "-", "tuning"] ✅ | ["fine", "-", "tun", "ing"] | Hyphenated words vary |
"transformers" | ["transformers"] ✅ | ["transform", "ers"] ✅ | Popular word, both work |
Key Insight: BPE shines with new/rare words, while both approaches work well for common vocabulary. The advantage isn't "always better" but "better generalization to unseen text".
WordPiece Tokenization
WordPiece is another subword algorithm, famously used in Google's BERT and related models.
How WordPiece Differs from BPE
Unlike BPE, which selects pairs based on frequency, WordPiece uses a likelihood-based approach:
- Initialize vocabulary with individual characters
- Calculate the likelihood increase for each possible merge
- Perform the merge that maximizes likelihood
- Repeat until desired vocabulary size
The likelihood is based on the probability increase for the language model when two symbols are merged.
WordPiece Algorithm
Given a language model , the likelihood increase for merging symbols and is:
Intuitively, this prioritizes merges that create meaningful subwords over just frequent ones.
Unique Characteristics
- Prefix Marking: WordPiece marks subword units with '##' prefix (except for the first piece)
- Out-of-Vocabulary Handling: Unknown words are broken into smaller subwords or individual characters
Comparison with BPE
Example | BPE Tokenization | WordPiece Tokenization |
---|---|---|
playing | ['play', 'ing'] | ['play', '##ing'] |
unbreakable | ['un', 'break', 'able'] | ['un', 'break', '##able'] |
huggingface | ['hugging', 'face'] | ['hugging', '##face'] |
transformers | ['transform', 'ers'] | ['transform', '##ers'] |
tokenization | ['token', 'ization'] | ['token', '##ization'] |
Note: Notice how WordPiece uses '##' to mark subword pieces that are not at the beginning of a word.
Implementation Example
While the exact WordPiece algorithm is proprietary to Google, we can use Hugging Face's Tokenizers library:
pythonfrom tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace # Initialize a tokenizer tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) # Pre-tokenize on whitespace tokenizer.pre_tokenizer = Whitespace()
Applications of WordPiece
- Google's BERT
- Google's DistilBERT, ALBERT, and ELECTRA
- Many multilingual models
SentencePiece
SentencePiece, developed by Google, is a language-agnostic tokenizer that treats the input as a raw stream of Unicode characters.
Key Features
- Language Agnostic: Works with any language without language-specific preprocessing
- Whitespace Preservation: Treats spaces as normal characters
- Direct Raw Text Processing: No need for pre-tokenization
- Reversible Tokenization: Can perfectly recover the original text
How SentencePiece Works
SentencePiece combines principles from both BPE and Unigram language models:
- BPE Mode: Similar to standard BPE, but operates on raw text
- Unigram Mode: Uses a unigram language model to find the most likely segmentation
SentencePiece Unigram Model
The Unigram model defines the probability of a sequence as:
where is a subword token and is its probability.
It starts with a large vocabulary and iteratively removes tokens to maximize the likelihood on the training data.
Interactive Visualization
SentencePiece Algorithm Demo
SentencePiece Tokens (9 tokens):
SentencePiece Features Demonstrated:
- Whitespace Preservation: ▁ marks word boundaries (spaces)
- Language Agnostic: No pre-tokenization required
- Subword Tokenization: Breaks unknown words into known pieces
- Reversible: Can perfectly reconstruct original text
Implementation
pythonimport sentencepiece as spm # Train SentencePiece model spm.SentencePieceTrainer.train( input='data.txt', model_prefix='sentencepiece', vocab_size=8000, model_type='unigram', # or 'bpe' character_coverage=0.9995, normalization_rule_name='nmt_nfkc'
Applications of SentencePiece
- Google's T5 and PaLM models
- Meta AI's LLaMA models
- XLNet and many multilingual models
- Particularly popular for non-English and multilingual models
Comparison of Tokenization Methods
Performance Across Languages
Language | Word-Level | BPE | WordPiece | SentencePiece | Winner |
---|---|---|---|---|---|
English | 92% | 95% | 94% | 95% | Tie |
Chinese | 45% | 80% | 82% | 94% | SentencePiece |
Japanese | 50% | 85% | 83% | 95% | SentencePiece |
German | 85% | 92% | 91% | 94% | SentencePiece |
Arabic | 80% | 88% | 89% | 93% | SentencePiece |
Russian | 75% | 90% | 88% | 93% | SentencePiece |
Key Insight: SentencePiece consistently achieves the highest coverage across languages, explaining its dominance in multilingual and modern LLMs.
Feature Comparison
Feature | BPE | WordPiece | SentencePiece |
---|---|---|---|
Current Popularity | High | Medium | Highest |
2024 Usage | GPT, RoBERTa | BERT family | LLaMA, T5, PaLM |
Merge criterion | Frequency | Likelihood | Frequency or Likelihood |
Pre-tokenization | Required | Required | Not required |
Language support | Partially agnostic | Partially agnostic | Fully agnostic |
Whitespace handling | Removed | Removed | Preserved |
Subword marking | None | ##prefix | ▁ prefix |
Vocabulary size | 10k-50k | 10k-30k | 8k-32k |
Out-of-vocabulary | Character fallback | Character fallback | Character fallback |
Reversibility | Partial | Partial | Complete |
Which Tokenizer Should You Use in 2024?
For New Projects:
- SentencePiece (Unigram) - Best overall choice, especially for multilingual
- Fine-tuning existing models - Use the original model's tokenizer
- English-only BERT tasks - WordPiece is fine
- GPT-style generation - BPE or byte-level BPE
Decision Tree:
- Fine-tuning a pre-trained model? → Use its original tokenizer
- Building from scratch + multilingual? → SentencePiece
- Building from scratch + English-only? → SentencePiece or BPE
- Need perfect reversibility? → SentencePiece
Advanced Topics
Tokenization Implications for Model Performance
The choice of tokenization strategy has profound effects on:
- Model Size: Vocabulary size directly impacts embedding layer parameters
- Training Efficiency: Better tokenization means more efficient training
- Language Support: Some tokenizers handle certain languages better
- Model Generalization: Good subword tokenization improves generalization to new words
Tokenization Challenges
- Language Boundaries: Not all languages use spaces or have clear word boundaries
- Morphologically Rich Languages: Languages like Finnish or Turkish have complex word structures
- Code-Switching: Handling text that mixes multiple languages
- Non-linguistic Content: Emojis, URLs, hashtags, code snippets
Beyond Subword Tokenization
Research continues to improve tokenization:
- Character-level Transformers: Bypass tokenization entirely
- Byte-level BPE: GPT-2/3/4 use byte-level BPE to handle any Unicode character
- Dynamic Tokenization: Adapt tokenization based on the input
- Tokenization-free Models: Some experimental approaches try to work directly with raw text
Practical Implementation
Choosing the Right Tokenizer
Guidelines for selecting a tokenizer:
- Task Alignment: Match your tokenizer with your downstream task
- Model Compatibility: If fine-tuning, use the original model's tokenizer
- Language Support: Consider language-specific needs
- Vocabulary Size: Balance between coverage and computational efficiency
Tokenization in the Hugging Face Ecosystem
The Hugging Face Tokenizers library provides fast implementations of all major tokenization algorithms:
pythonfrom transformers import AutoTokenizer # Load pre-trained tokenizers bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") t5_tokenizer = AutoTokenizer.from_pretrained("t5-base") # Example text text = "Tokenization splits text into subword units!"
Interactive Multi-Tokenizer Comparison
Tokenizer Comparison
Summary
Key Takeaways:
- SentencePiece (Unigram) dominates in 2024 - used by LLaMA, T5, PaLM, and most new models
- BPE remains important for GPT-family models and established architectures
- WordPiece still powers BERT and Google's model ecosystem
- For new projects: Choose SentencePiece unless you have specific requirements
- For fine-tuning: Always use the original model's tokenizer
The Evolution:
- 2018-2020: WordPiece (BERT era) and BPE (GPT era) dominated
- 2021-2024: SentencePiece became the de facto standard for new LLMs
- Future: Trend toward language-agnostic, reversible tokenization
Practical Reality: Most practitioners use pre-trained tokenizers rather than training from scratch. Understanding these algorithms helps you make informed choices about model selection and fine-tuning strategies.
In our next lesson, we'll explore word embeddings, starting from traditional approaches like Word2Vec and GloVe, before moving to the contextual representations that power today's most advanced models.
Practice Exercises
- Implement a simple BPE tokenizer from scratch and train it on a small corpus.
- Compare tokenization results from different algorithms on texts from various languages and domains.
- Experiment with vocabulary size to see how it affects tokenization granularity.
- Fine-tune a pretrained model using a different tokenizer and evaluate the performance impact.
Additional Resources
- Hugging Face Tokenizers Documentation
- SentencePiece: A simple and language independent subword tokenizer and detokenizer
- Neural Machine Translation of Rare Words with Subword Units (Original BPE for NLP paper)
- Google's Neural Machine Translation System (WordPiece paper)
- Comparing Different Tokenizers for BERT