Advanced Tokenization Techniques

Overview

In our previous lesson, we introduced basic tokenization methods like word and character tokenization. While intuitive, these approaches have significant limitations when handling large vocabularies, out-of-vocabulary words, and morphologically rich languages.

Modern NLP models rely on sophisticated subword tokenization strategies that find an optimal balance between character-level and word-level representations. Today's leading models use three main approaches:

SentencePiece (Unigram): Dominant in 2024 - used by LLaMA, PaLM, T5, and most multilingual models
Byte-Pair Encoding (BPE): Powers GPT models and many encoder-decoder architectures
WordPiece: Foundation of BERT-family models and Google's ecosystem

This lesson explores these subword tokenization techniques that have revolutionized NLP, with hands-on tools to understand how each algorithm works in practice.

Learning Objectives

After completing this lesson, you will be able to:

Understand the limitations of traditional tokenization approaches
Explain how modern subword tokenization algorithms work
Compare different subword tokenization methods (BPE, WordPiece, SentencePiece)
Implement and use subword tokenizers in practice
Select appropriate tokenization strategies for different NLP tasks

The Need for Subword Tokenization

Limitations of Word-Level Tokenization

Word tokenization seemed intuitive in our previous lesson, but it has several critical weaknesses:

Vocabulary Explosion: Languages are productive — they can generate a virtually unlimited number of words through compounding, inflection, and derivation.
Out-of-Vocabulary (OOV) Words: Any word not seen during training becomes an <UNK> (unknown) token, losing all semantic information.
Morphological Blindness: The tokens "play", "playing", and "played" are treated as completely different words, even though they share the same root.
Rare Words Problem: Infrequent words have sparse statistics, making it difficult for models to learn good representations.

Analogy: Word Construction as Lego Blocks

Think of words as structures built from smaller reusable pieces, like Lego blocks. Rather than trying to pre-manufacture every possible structure (word), we can provide the fundamental blocks and rules for combining them.

In English: "un" + "break" + "able" = "unbreakable"
In German: "Grund" + "gesetz" + "buch" = "Grundgesetzbuch" (constitution)

Visualization: Vocabulary Size vs. Coverage

The Vocabulary Size Problem

Vocabulary Size	Word-level Coverage	Subword Coverage (BPE)	Difference
10K tokens	80.5%	95.8%	+15.3%
20K tokens	85.2%	97.9%	+12.7%
30K tokens	87.9%	98.6%	+10.7%
50K tokens	90.5%	99.2%	+8.7%
100K tokens	93.4%	99.8%	+6.4%

Key Insight: Subword tokenization achieves 95%+ coverage with just 10K tokens, while word-level tokenization needs 100K+ tokens to reach 93% coverage.

Why This Matters

Memory Efficiency: Smaller vocabularies = smaller embedding matrices
Better Generalization: Higher coverage means fewer <UNK> tokens
Computational Efficiency: Less vocabulary means faster training and inference

Byte-Pair Encoding (BPE)

BPE is one of the most widely used subword tokenization algorithms, employed by models like GPT (OpenAI) and BART (Facebook).

History and Origins

Originally developed as a data compression algorithm by Philip Gage in 1994, BPE was adapted for NLP by Rico Sennrich in 2016 for neural machine translation.

How BPE Works

BPE follows a simple yet effective procedure:

Initialize vocabulary with individual characters
Count all symbol pairs in the corpus
Merge the most frequent pair
Repeat steps 2-3 until desired vocabulary size or stopping criterion is reached

Step-by-Step BPE Algorithm Example

Let's trace through the BPE algorithm with the corpus: "low lower lowest"

Step 1: Initialize with characters + end-of-word marker

Text:     low    lower   lowest
Tokens:   l o w </w>   l o w e r </w>   l o w e s t </w>

Step 2: Count all adjacent pairs

Pair counts:
('l', 'o'): 3    # appears in all three words
('o', 'w'): 3    # appears in all three words  
('w', '</w>'): 1 # only in "low"
('w', 'e'): 2    # in "lower" and "lowest"
('e', 'r'): 1    # only in "lower"
('r', '</w>'): 1 # only in "lower"
('e', 's'): 1    # only in "lowest"
('s', 't'): 1    # only in "lowest"
('t', '</w>'): 1 # only in "lowest"

Step 3: Merge most frequent pair → ('l', 'o') becomes 'lo'

Text:     low    lower   lowest
Tokens:   lo w </w>   lo w e r </w>   lo w e s t </w>

Step 4: Recount pairs

Pair counts:
('lo', 'w'): 3   # now most frequent
('w', '</w>'): 1
('w', 'e'): 2
('e', 'r'): 1
... (other pairs)

Step 5: Merge ('lo', 'w') → 'low'

Text:     low    lower   lowest
Tokens:   low </w>   low e r </w>   low e s t </w>

Step 6: Continue merging...

After merging ('e', 'r'):
Tokens:   low </w>   low er </w>   low e s t </w>

After merging ('e', 's'):  
Tokens:   low </w>   low er </w>   low es t </w>

Final vocabulary: {l, o, w, e, r, s, t, </w>, lo, low, er, es, ...}

Key Insights from this Example

Frequency drives merging: Most common character pairs get merged first
Hierarchical building: Simple subwords become building blocks for complex ones
Shared subwords: "low" appears in all variants, maximizing reuse
Morphology awareness: Suffixes like "er", "es", "est" emerge naturally

Interactive BPE Algorithm Explorer

BPE Algorithm Visualizer

Input Text:

Max Vocabulary Size:

Current Tokens (Step 0):

Vocabulary (0 tokens):

Note: The tool above should demonstrate the character-level initialization and frequency-based pair merging. If it's not working as expected, please use the step-by-step explanation above to understand the true BPE process.

Python Implementation

Here's a simplified implementation of BPE training:

python
from collections import Counter
import re

def get_stats(vocab):
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

Applications of BPE

OpenAI's GPT models (GPT-2, GPT-3, GPT-4)
Facebook's BART and RoBERTa
Hugging Face's Tokenizers library

BPE Training Algorithm Summary

The complete BPE training process:

Corpus Preparation: Split text into words, then into characters
Iterative Merging:
- Count all adjacent symbol pairs
- Merge the most frequent pair
- Update the corpus with merged symbols
- Repeat until desired vocabulary size
Vocabulary Creation: Final set contains original characters + learned merges

BPE vs. Word Tokenization: Realistic Comparison

Scenario: 30K vocabulary trained on general text, encountering new domain-specific text

Input Text	Word Tokenization	BPE Tokenization	Why BPE Helps
"COVID-19"	`["<UNK>"]`	`["COVID", "-", "19"]`	New compound word
"unbreakable"	`["<UNK>"]`	`["un", "break", "able"]`	Morphology preserved
"machine"	`["machine"]` ✅	`["machine"]` ✅	Common word in both
"BiLSTM"	`["<UNK>"]`	`["Bi", "LSTM"]`	Technical abbreviation
"fine-tuning"	`["fine", "-", "tuning"]` ✅	`["fine", "-", "tun", "ing"]`	Hyphenated words vary
"transformers"	`["transformers"]` ✅	`["transform", "ers"]` ✅	Popular word, both work

Key Insight: BPE shines with new/rare words, while both approaches work well for common vocabulary. The advantage isn't "always better" but "better generalization to unseen text".

WordPiece Tokenization

WordPiece is another subword algorithm, famously used in Google's BERT and related models.

How WordPiece Differs from BPE

Unlike BPE, which selects pairs based on frequency, WordPiece uses a likelihood-based approach:

Initialize vocabulary with individual characters
Calculate the likelihood increase for each possible merge
Perform the merge that maximizes likelihood
Repeat until desired vocabulary size

The likelihood is based on the probability increase for the language model when two symbols are merged.

WordPiece Algorithm

Given a language model $p(w_1,...,w_n)$ , the likelihood increase for merging symbols $a$ and $b$ is:

$\frac{p(a \cdot b)}{p(a) \cdot p(b)}$

Intuitively, this prioritizes merges that create meaningful subwords over just frequent ones.

Unique Characteristics

Prefix Marking: WordPiece marks subword units with '##' prefix (except for the first piece)
Out-of-Vocabulary Handling: Unknown words are broken into smaller subwords or individual characters

Comparison with BPE

Example	BPE Tokenization	WordPiece Tokenization
playing	['play', 'ing']	['play', '##ing']
unbreakable	['un', 'break', 'able']	['un', 'break', '##able']
huggingface	['hugging', 'face']	['hugging', '##face']
transformers	['transform', 'ers']	['transform', '##ers']
tokenization	['token', 'ization']	['token', '##ization']

Note: Notice how WordPiece uses '##' to mark subword pieces that are not at the beginning of a word.

Implementation Example

While the exact WordPiece algorithm is proprietary to Google, we can use Hugging Face's Tokenizers library:

python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# Pre-tokenize on whitespace
tokenizer.pre_tokenizer = Whitespace()

Applications of WordPiece

Google's BERT
Google's DistilBERT, ALBERT, and ELECTRA
Many multilingual models

SentencePiece

SentencePiece, developed by Google, is a language-agnostic tokenizer that treats the input as a raw stream of Unicode characters.

Key Features

Language Agnostic: Works with any language without language-specific preprocessing
Whitespace Preservation: Treats spaces as normal characters
Direct Raw Text Processing: No need for pre-tokenization
Reversible Tokenization: Can perfectly recover the original text

How SentencePiece Works

SentencePiece combines principles from both BPE and Unigram language models:

BPE Mode: Similar to standard BPE, but operates on raw text
Unigram Mode: Uses a unigram language model to find the most likely segmentation

SentencePiece Unigram Model

The Unigram model defines the probability of a sequence as:

$P(x) = \prod_{i=1}^{m} p(x_i)$

where $x_i$ is a subword token and $p(x_i)$ is its probability.

It starts with a large vocabulary and iteratively removes tokens to maximize the likelihood on the training data.

Interactive Visualization

SentencePiece Algorithm Demo

Input Text:

Vocabulary Size:

Model Type:

SentencePiece Tokens (9 tokens):

SentencePiece▁is▁an▁unsupervised▁text▁tokenizer.

SentencePiece Features Demonstrated:

Whitespace Preservation: ▁ marks word boundaries (spaces)
Language Agnostic: No pre-tokenization required
Subword Tokenization: Breaks unknown words into known pieces
Reversible: Can perfectly reconstruct original text

Implementation

python
import sentencepiece as spm

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='sentencepiece',
    vocab_size=8000,
    model_type='unigram',  # or 'bpe'
    character_coverage=0.9995,
    normalization_rule_name='nmt_nfkc'

Applications of SentencePiece

Google's T5 and PaLM models
Meta AI's LLaMA models
XLNet and many multilingual models
Particularly popular for non-English and multilingual models

Comparison of Tokenization Methods

Performance Across Languages

Language	Word-Level	BPE	WordPiece	SentencePiece	Winner
English	92%	95%	94%	95%	Tie
Chinese	45%	80%	82%	94%	SentencePiece
Japanese	50%	85%	83%	95%	SentencePiece
German	85%	92%	91%	94%	SentencePiece
Arabic	80%	88%	89%	93%	SentencePiece
Russian	75%	90%	88%	93%	SentencePiece

Key Insight: SentencePiece consistently achieves the highest coverage across languages, explaining its dominance in multilingual and modern LLMs.

Feature Comparison

Feature	BPE	WordPiece	SentencePiece
Current Popularity	High	Medium	Highest
2024 Usage	GPT, RoBERTa	BERT family	LLaMA, T5, PaLM
Merge criterion	Frequency	Likelihood	Frequency or Likelihood
Pre-tokenization	Required	Required	Not required
Language support	Partially agnostic	Partially agnostic	Fully agnostic
Whitespace handling	Removed	Removed	Preserved
Subword marking	None	##prefix	▁ prefix
Vocabulary size	10k-50k	10k-30k	8k-32k
Out-of-vocabulary	Character fallback	Character fallback	Character fallback
Reversibility	Partial	Partial	Complete

Which Tokenizer Should You Use in 2024?

For New Projects:

SentencePiece (Unigram) - Best overall choice, especially for multilingual
Fine-tuning existing models - Use the original model's tokenizer
English-only BERT tasks - WordPiece is fine
GPT-style generation - BPE or byte-level BPE

Decision Tree:

Fine-tuning a pre-trained model? → Use its original tokenizer
Building from scratch + multilingual? → SentencePiece
Building from scratch + English-only? → SentencePiece or BPE
Need perfect reversibility? → SentencePiece

Advanced Topics

Tokenization Implications for Model Performance

The choice of tokenization strategy has profound effects on:

Model Size: Vocabulary size directly impacts embedding layer parameters
Training Efficiency: Better tokenization means more efficient training
Language Support: Some tokenizers handle certain languages better
Model Generalization: Good subword tokenization improves generalization to new words

Tokenization Challenges

Language Boundaries: Not all languages use spaces or have clear word boundaries
Morphologically Rich Languages: Languages like Finnish or Turkish have complex word structures
Code-Switching: Handling text that mixes multiple languages
Non-linguistic Content: Emojis, URLs, hashtags, code snippets

Beyond Subword Tokenization

Research continues to improve tokenization:

Character-level Transformers: Bypass tokenization entirely
Byte-level BPE: GPT-2/3/4 use byte-level BPE to handle any Unicode character
Dynamic Tokenization: Adapt tokenization based on the input
Tokenization-free Models: Some experimental approaches try to work directly with raw text

Practical Implementation

Choosing the Right Tokenizer

Guidelines for selecting a tokenizer:

Task Alignment: Match your tokenizer with your downstream task
Model Compatibility: If fine-tuning, use the original model's tokenizer
Language Support: Consider language-specific needs
Vocabulary Size: Balance between coverage and computational efficiency

Tokenization in the Hugging Face Ecosystem

The Hugging Face Tokenizers library provides fast implementations of all major tokenization algorithms:

python
from transformers import AutoTokenizer

# Load pre-trained tokenizers
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")

# Example text
text = "Tokenization splits text into subword units!"

Interactive Multi-Tokenizer Comparison

Tokenizer Comparison

Comparedifferenttokenizationmethodseasily.

Summary

Key Takeaways:

SentencePiece (Unigram) dominates in 2024 - used by LLaMA, T5, PaLM, and most new models
BPE remains important for GPT-family models and established architectures
WordPiece still powers BERT and Google's model ecosystem
For new projects: Choose SentencePiece unless you have specific requirements
For fine-tuning: Always use the original model's tokenizer

The Evolution:

2018-2020: WordPiece (BERT era) and BPE (GPT era) dominated
2021-2024: SentencePiece became the de facto standard for new LLMs
Future: Trend toward language-agnostic, reversible tokenization

Practical Reality: Most practitioners use pre-trained tokenizers rather than training from scratch. Understanding these algorithms helps you make informed choices about model selection and fine-tuning strategies.

In our next lesson, we'll explore word embeddings, starting from traditional approaches like Word2Vec and GloVe, before moving to the contextual representations that power today's most advanced models.

Practice Exercises

Implement a simple BPE tokenizer from scratch and train it on a small corpus.
Compare tokenization results from different algorithms on texts from various languages and domains.
Experiment with vocabulary size to see how it affects tokenization granularity.
Fine-tune a pretrained model using a different tokenizer and evaluate the performance impact.

Additional Resources

Hugging Face Tokenizers Documentation
SentencePiece: A simple and language independent subword tokenizer and detokenizer
Neural Machine Translation of Rare Words with Subword Units (Original BPE for NLP paper)
Google's Neural Machine Translation System (WordPiece paper)
Comparing Different Tokenizers for BERT

NLP Fundamentals: Core Concepts and Architectures