Overview
When language models predict the next token, they output probability distributions over the entire vocabulary. But how do we convert these probabilities into actual text? This lesson explores deterministic generation methods—approaches that make predictable, reproducible choices about which tokens to select from transformer-based language models.
Understanding these foundational methods is crucial before we explore more creative sampling techniques. Think of this as learning to walk before we run: mastering predictable text generation gives us the foundation to understand when and why we might want to introduce randomness in language model outputs.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the fundamental challenge of converting probabilities to text
- Implement and explain greedy search generation
- Implement and explain beam search generation
- Compare the trade-offs between different deterministic approaches
- Choose the right method for specific use cases
- Debug common issues in deterministic generation
The Core Challenge: From Probabilities to Words
The Decision Point
Every time a language model generates text, it faces the same fundamental challenge at each step:
Model output: [0.4, 0.3, 0.15, 0.1, 0.05, ...] Vocabulary: ["the", "a", "an", "this", "that", ...] Decision: Which word do we actually choose?
Mental Model: The Path Through a Forest
Imagine text generation as walking through a dense forest where every step forward reveals multiple possible paths:
- Starting point: Your prompt (the trailhead)
- Each step: Choosing the next word
- Path options: All possible next words, each with a "difficulty score" (inverse of probability)
- Goal: Reach a meaningful destination (complete text)
Different generation strategies represent different hiking philosophies:
- Greedy: Always take the easiest visible path
- Beam Search: Scout ahead on multiple promising paths, then choose the best overall route
Greedy Search: The Simplest Approach
Core Concept
Greedy search always chooses the most probable next token. It's called "greedy" because it makes locally optimal choices without considering future consequences.
Algorithm in Plain English:
- Look at all possible next words
- Pick the one with the highest probability
- Add it to your text
- Repeat until you decide to stop
Visualization: Greedy Path Selection
Decoding Path Visualization
The decoding path shows token probabilities at each step and the selected token (highlighted). This visualization demonstrates how the model makes decisions during text generation.
Python Implementation
pythondef greedy_search(model, tokenizer, prompt, max_length=50): """ Generate text using greedy search (always pick most likely token). Args: model: The language model tokenizer: Tokenizer for encoding/decoding prompt: Starting text max_length: Maximum tokens to generate """
When Greedy Search Works Well
✅ Excellent for:
- Factual question answering: "What is the capital of France?" → "Paris"
- Short completions: Where there's usually one best answer
- Translation of common phrases: Well-established translations
- Code completion: Where syntax correctness matters most
Example outputs where greedy excels:
Input: "The chemical symbol for water is" Greedy: "H2O, consisting of two hydrogen atoms and one oxygen atom." ✅ Perfect! Factual and correct. Input: "To install numpy, run" Greedy: "pip install numpy" ✅ Exactly what we want!
When Greedy Search Fails
❌ Problems arise with:
- Creative writing: Produces boring, predictable text
- Long text generation: Gets stuck in repetitive loops
- Open-ended questions: Gives generic responses
Example where greedy fails:
Input: "Once upon a time, in a magical forest" Greedy: "there was a little girl who was walking through the forest. She was walking through the forest when she saw a little girl walking through the forest..." ❌ Repetitive and boring!
The Repetition Problem
Greedy search often gets trapped in loops because:
- Common words have high probability
- Once generated, they increase probability of similar patterns
- No mechanism to avoid repetition
Demonstration:
python# This often happens with greedy search prompt = "I think that" # Greedy might produce: "I think that I think that I think that..."
Beam Search: Exploring Multiple Paths
Core Concept
Beam search improves on greedy search by considering multiple possible sequences simultaneously. Instead of committing to one path, it keeps track of the top-K most promising sequences (called "beams").
Key Innovation: Look ahead before making final decisions.
The Beam Width Parameter
- Beam width = 1: Equivalent to greedy search
- Beam width = 3: Keep track of 3 best sequences
- Beam width = 5: Keep track of 5 best sequences
- Higher beam width: More exploration, more computation
Visualization: Beam Search Tree
Beam Search Visualization
Beam search maintains multiple candidate sequences (beams) at each step, choosing the most likely continuations based on cumulative probability.
Algorithm Walkthrough
Let's trace through beam search step by step:
Step 1: Start with prompt
Prompt: "The future of AI" Beams: ["The future of AI"]
Step 2: Generate first token
Top candidates: "is", "will", "depends" Beams: ["The future of AI is", "The future of AI will", "The future of AI depends"]
Step 3: Generate second token (from each beam)
From "...is": "bright", "uncertain", "promising" From "...will": "be", "depend", "involve" From "...depends": "on", "heavily", "largely" Keep top 3 overall: 1. "The future of AI is bright" 2. "The future of AI will be" 3. "The future of AI depends on"
Python Implementation
pythondef beam_search(model, tokenizer, prompt, beam_width=5, max_length=50): """ Generate text using beam search. Args: beam_width: Number of sequences to track simultaneously """ # Encode the starting prompt input_ids = tokenizer.encode(prompt, return_tensors="pt")
Beam Search Advantages
✅ Better than greedy because:
- Avoids local optima: Considers multiple paths before deciding
- Higher quality output: Often more coherent and fluent
- Still deterministic: Same input always gives same output
- Configurable exploration: Adjust beam width for your needs
Comparison example:
Prompt: "The solution to climate change requires" Greedy: "a global effort to reduce carbon emissions." Beam (width=5): "coordinated international cooperation, technological innovation, and fundamental changes in how we produce and consume energy."
Beam Search Limitations
❌ Still has issues:
- Computational cost: 5x more expensive than greedy (with beam width 5)
- Generic output: Tends toward "safe" but boring completions
- Length bias: Favors shorter sequences (more tokens = lower probability)
- Still deterministic: No creativity or surprise
Beam Width Selection Guide
Task Type | Recommended Beam Width | Reasoning |
---|---|---|
Translation | 4-6 | Balance quality and computation |
Summarization | 3-5 | Want coherent but not too generic |
Question Answering | 1-3 | Usually one correct answer |
Creative Writing | Not recommended | Too conservative |
Code Generation | 2-4 | Syntax correctness important |
Direct Comparison: Greedy vs. Beam Search
Side-by-Side Examples
Sampling Methods Comparison
Different methods produce different outputs from the same prompt. The optimal sampling strategy depends on your specific application and requirements for creativity vs. predictability.
Quality Metrics Comparison
Aspect | Greedy Search | Beam Search (width=5) |
---|---|---|
Fluency | Good | Very Good |
Coherence | Good | Very Good |
Creativity | Poor | Poor |
Consistency | High | High |
Speed | Fast | 5x Slower |
Memory Usage | Low | 5x Higher |
When to Choose Each Method
Choose Greedy Search when:
- Speed is critical
- Memory is limited
- Task has clear "correct" answers
- Generating short completions
- Prototyping or debugging
Choose Beam Search when:
- Quality is more important than speed
- Task benefits from planning ahead (translation, summarization)
- You have computational resources
- Output length is moderate (20-100 tokens)
- Serving fewer requests but need higher quality
Practical Implementation with Hugging Face
Modern libraries make these methods easy to use:
pythonfrom transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer # Setup model_name = "gpt2" model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2Tokenizer.from_pretrained(model_name) generator = pipeline('text-generation', model=model, tokenizer=tokenizer) # Greedy search (do_sample=False, num_beams=1) greedy_output = generator(
Key Parameters Explained
do_sample=False
: Use deterministic methods (greedy/beam)num_beams=1
: Greedy searchnum_beams>1
: Beam search with specified widthearly_stopping=True
: Stop when EOS token is generatednum_return_sequences
: How many different outputs to return
Common Issues and Debugging
Problem 1: Repetitive Output
python# Symptoms: "I think that I think that I think..." # Solutions: - Try beam search instead of greedy - Add length constraints - Use repetition penalty (covered in next lesson)
Problem 2: Truncated Output
python# Symptoms: Output stops too early # Solutions: - Increase max_length parameter - Check for unexpected EOS tokens - Verify tokenizer settings
Problem 3: Generic Output
python# Symptoms: Boring, predictable text # Solutions: - This is expected with deterministic methods - Consider probabilistic sampling (next lesson) - Adjust beam width
Problem 4: High Memory Usage
python# Symptoms: Out of memory errors with beam search # Solutions: - Reduce beam width - Process shorter sequences - Use gradient checkpointing - Switch to greedy for prototyping
Summary and Next Steps
What We've Learned
- Greedy Search: Simple, fast, but can get stuck in loops
- Beam Search: Better quality through exploration, but more expensive
- Trade-offs: Speed vs. quality, determinism vs. creativity
- Use Cases: When to choose each method
The Limitation of Deterministic Methods
Both greedy and beam search share a fundamental limitation: they're too conservative. They always choose the "safe" options, leading to:
- Predictable, sometimes boring text
- Lack of creativity and surprise
- Difficulty with open-ended creative tasks
What's Next
In our next lesson, we'll explore probabilistic sampling techniques that introduce controlled randomness:
- Temperature sampling: Adding creativity while maintaining quality
- Top-k sampling: Limiting choices to reasonable options
- Nucleus (top-p) sampling: Dynamically adjusting choice sets
- Combining techniques: Building production-ready systems
These methods will give us the tools to generate more interesting, creative, and diverse text while still maintaining quality and coherence.
Practice Exercises
Exercise 1: Implementation Challenge
Implement both greedy and beam search from scratch (without using Hugging Face's built-in methods). Compare your results with the library versions.
Exercise 2: Parameter Exploration
Test beam search with different beam widths (1, 3, 5, 10) on the same prompt. Analyze how beam width affects:
- Output quality
- Generation time
- Memory usage
Exercise 3: Use Case Analysis
For each scenario below, decide whether to use greedy search or beam search and justify your choice:
- Real-time chatbot responses
- Academic paper summarization
- Code completion in an IDE
- News article generation
- Poetry writing assistant
Exercise 4: Debugging Practice
Given these problematic outputs, identify the likely issue and propose solutions:
- "The the the the the the..."
- Output that stops after just 3 tokens
- Very generic, dictionary-like responses
- Out of memory errors
Additional Resources
- Hugging Face Text Generation Guide
- The Illustrated GPT-2 - Great visualization of the generation process
- Beam Search Implementation Tutorial
- Google's Neural Machine Translation System - Where beam search shines
Common Issues and Solutions
Repetition Problems:
- Use repetition penalty
- Consider probabilistic sampling techniques (covered in our next lesson)
Quality vs. Speed Trade-offs:
- Beam search: Better quality, slower
- Greedy search: Faster, potentially lower quality
- Consider probabilistic sampling for creative applications (next lesson)
When to Use Deterministic Methods
Ideal for:
- Factual question answering where accuracy is paramount
- Machine translation where consistency matters
- Summarization tasks requiring faithful representation
- Any application where deterministic, reproducible outputs are required
Less ideal for:
- Creative writing where diversity and surprise are valued
- Conversational AI where natural variation is important
- Brainstorming applications requiring multiple diverse ideas
In our next lesson, we'll explore probabilistic sampling techniques that introduce controlled randomness: