Text Generation: Probabilistic Sampling

Overview

In our previous lesson, we mastered deterministic generation methods—greedy search and beam search. These techniques are excellent for tasks requiring consistency and correctness, but they share a fundamental limitation when generating text from language models: they're too conservative.

When we want creative, diverse, or surprising text generation from transformer models, we need to introduce controlled randomness. This lesson explores probabilistic sampling techniques that balance creativity with quality, giving language models the ability to produce varied, interesting outputs while maintaining coherence.

Think of this as the difference between a conversation with a very knowledgeable but predictable expert versus one with a creative, thoughtful friend who surprises you with interesting perspectives.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why randomness improves text generation
  • Implement and tune temperature sampling for creativity control
  • Use top-k sampling to limit choice sets intelligently
  • Apply nucleus (top-p) sampling for dynamic token selection
  • Combine multiple techniques for production-ready systems
  • Debug and optimize sampling parameters for different use cases
  • Handle common issues like repetition and incoherence

The Case for Controlled Randomness

Why Perfect Predictions Aren't Perfect

Deterministic methods optimize for likelihood—they choose what's most probable given the training data. But the most probable text isn't always the most:

  • Interesting: "The weather is nice" vs. "The crimson sunset painted the horizon"
  • Useful: Generic responses vs. specific, tailored answers
  • Human-like: Robotic predictability vs. natural variation

The Exploration-Exploitation Balance

Every text generation step involves a fundamental trade-off:

Exploration vs. Exploitation Trade-off

Prompt:
Scientists discovered
Understanding the Trade-off:
Exploitation
  • Use the highest probability tokens
  • Maximize coherence and fluency
  • Follow predictable patterns
  • Good for factual, precise tasks
  • Lower chance of errors
Exploration
  • Consider lower probability tokens
  • Increase diversity and creativity
  • Discover novel combinations
  • Good for creative, open-ended tasks
  • Higher chance of interesting insights
Sampling Parameter Effects:
Increased Exploitation
Temperature ↓
Top-K ↓
Top-P ↓
Increased Exploration
Temperature ↑
Top-K ↑
Top-P ↑
Example Continuations:
Low Exploration
Temperature ≈ 0.3
Scientists discovered a surprising correlation between solar flares and quantum computing errors.
Medium Exploration
Temperature ≈ 0.7
Scientists discovered an unusual phenomenon that contradicted existing theories.
High Exploration
Temperature ≈ 1.2
Scientists discovered that reality itself might be a holographic projection of quantum information encoded on the universe's boundary.

The balance between exploration and exploitation is a fundamental trade-off in text generation. Finding the right balance depends on your specific needs for creativity versus reliability.

Real-world analogy: Choosing a restaurant

  • Exploitation: Always go to your proven favorite
  • Exploration: Try completely random new places
  • Smart sampling: Try highly-rated new places in genres you like

Temperature Sampling: The Creativity Dial

Core Concept

Temperature sampling modifies the probability distribution before sampling, controlling how "sharp" or "flat" the distribution becomes.

Mathematical formulation: pi=exp(zi/T)jexp(zj/T)p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}

Where:

  • ziz_i = original logit for token ii
  • TT = temperature parameter
  • Lower TT → more focused (sharper distribution)
  • Higher TT → more random (flatter distribution)

Temperature Effects Visualization

Temperature Sampling Visualization

Prompt:
In the future, AI will
Original Token Distribution:
Token probability distribution
Temperature = 0.3
Sample completions:
In the future, AI will word1...
In the future, AI will word0...
In the future, AI will word0...
Temperature = 0.7
Sample completions:
In the future, AI will word3...
In the future, AI will word3...
In the future, AI will word1...
Temperature = 1.2
Sample completions:
In the future, AI will word0...
In the future, AI will word10...
In the future, AI will word7...

Temperature controls randomness in sampling. Lower values (near 0) make the model more deterministic, while higher values increase diversity but potentially reduce coherence.

Understanding Temperature Values

TemperatureEffectUse CasesExample Output Style
0.1-0.3Very focused, almost deterministicFactual Q&A, technical writing"Solar panels convert sunlight into electricity through photovoltaic cells."
0.5-0.8Balanced creativity and coherenceGeneral content, articles"Solar technology represents a paradigm shift toward sustainable energy solutions."
0.9-1.2Creative and diverseCreative writing, brainstorming"Sunlight dances across crystalline surfaces, awakening electrons in their silicon dreams."
1.5+Highly creative, potentially incoherentExperimental art, poetry"Quantum photons whisper secrets to semiconducting consciousness, birthing energy..."

Python Implementation

python
def temperature_sampling(model, tokenizer, prompt, temperature=0.7, max_length=50): """ Generate text using temperature sampling. Args: temperature: Controls randomness (lower = more focused, higher = more random) """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist()

Temperature Tuning Guidelines

For different content types:

python
# Recommended temperature ranges TEMPERATURE_GUIDES = { "factual_qa": 0.1, # Want precise, correct answers "technical_docs": 0.3, # Clear, accurate explanations "news_articles": 0.5, # Professional but not robotic "blog_posts": 0.7, # Engaging and personable "creative_writing": 0.9, # Original and surprising "poetry": 1.2, # Highly creative and artistic "brainstorming": 1.5, # Maximum idea diversity }

Top-K Sampling: Intelligent Choice Limitation

Core Concept

Top-K sampling addresses a key problem with temperature sampling: even with low temperature, there's still a small chance of selecting very inappropriate tokens. Top-K limits the choice to only the K most likely tokens.

Algorithm:

  1. Get probability distribution from model
  2. Select only the top-K most likely tokens
  3. Renormalize probabilities among these K tokens
  4. Sample from this reduced distribution (optionally with temperature)

Visualization: Top-K Filtering Effect

Top-K Sampling Visualization

Prompt:
The key to solving this problem is
Top-K Parameter: K = 40
Token Probability Distribution:
Original Distribution:
All 100 tokens
Top-K Distribution:
Top 40 tokens only
Key Insight:

Top-K sampling restricts the sampling pool to only the K most likely tokens, preventing the selection of highly improbable tokens while maintaining diversity.

Sample Continuations with K=40:
The key to solving this problem is 24...
The key to solving this problem is 14...
The key to solving this problem is 9...

Top-K sampling helps prevent low-quality or nonsensical outputs by restricting the sampling pool to only the K most likely next tokens.

Python Implementation

python
def top_k_sampling(model, tokenizer, prompt, k=50, temperature=1.0, max_length=50): """ Generate text using top-k sampling. Args: k: Number of top tokens to consider temperature: Temperature scaling (applied after top-k filtering) """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist()

Choosing K Values

K ValueEffectBest ForReasoning
10-20Very constrainedTechnical writing, Q&AOnly most confident predictions
30-50Balanced filteringGeneral content creationGood quality-diversity balance
80-100Light filteringCreative writingRemoves only clearly bad options
200+Minimal effectWhen you trust the modelMostly preserves original distribution

Top-K vs. Temperature Trade-offs

python
# Comparison of different approaches examples = [ {"method": "Pure temperature", "params": {"temperature": 0.8}, "pros": ["Simple", "Smooth control"], "cons": ["Can select very low-probability tokens"]}, {"method": "Pure top-k", "params": {"k": 50, "temperature": 1.0}, "pros": ["Prevents bad tokens", "Consistent quality"], "cons": ["Hard cutoff can be arbitrary"]},

Nucleus (Top-P) Sampling: Dynamic Choice Sets

Core Concept

Nucleus sampling (also called top-p sampling) addresses a key limitation of top-k: different contexts require different numbers of reasonable choices.

Key insight: Instead of a fixed number of tokens, select the smallest set of tokens whose cumulative probability exceeds threshold p.

Algorithm:

  1. Sort tokens by probability (descending)
  2. Find the smallest set where cumulative probability ≥ p
  3. Renormalize probabilities within this "nucleus"
  4. Sample from the nucleus

Why Nucleus Sampling is Revolutionary

Context-adaptive selection:

  • Confident predictions: Nucleus might contain only 5-10 tokens
  • Uncertain predictions: Nucleus might contain 100+ tokens
  • Self-adjusting: Model's confidence determines choice set size

Visualization: Nucleus Formation

Nucleus (Top-p) Sampling Visualization

Prompt:
When I consider the implications of
Nucleus Parameter: p = 0.9
The Nucleus:

For p = 0.9, the nucleus contains 72 tokens (72.0% of vocabulary)

Cumulative Probability Distribution:
p=0.9
Higher probability tokens
Lower probability tokens
Key Insight:

Nucleus sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds p. This adapts to the confidence of the model at each position.

Sample Continuations with p=0.9:
When I consider the implications of 68...
When I consider the implications of 39...
When I consider the implications of 41...

Nucleus sampling (top-p) is adaptive to the confidence of the model, using more tokens when the distribution is more uniform and fewer when it's more peaked.

Python Implementation

python
def nucleus_sampling(model, tokenizer, prompt, p=0.9, temperature=1.0, max_length=50): """ Generate text using nucleus (top-p) sampling. Args: p: Cumulative probability threshold (0.0 to 1.0) temperature: Temperature scaling """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist()

Choosing P Values

P ValueEffectNucleus SizeBest For
0.5-0.7ConservativeSmall, focusedTechnical content, Q&A
0.8-0.9BalancedMedium, adaptiveGeneral content, articles
0.92-0.95CreativeLarger, diverseCreative writing, storytelling
0.98+Very creativeVery largeExperimental, artistic content

Nucleus vs. Top-K Comparison

Sampling Methods Comparison

Prompt:
The most interesting aspect of machine learning is
Greedy
Parameters: default
The most interesting aspect of machine learning is obviously to consider how definitely the approach is.
The most interesting aspect of machine learning is obviously clearly despite its implications.
Beam
Parameters: num_beams=5
The most interesting aspect of machine learning is essential beneficial when applied correctly.
The most interesting aspect of machine learning is beneficial valuable when applied correctly.
Temperature
Parameters: temperature=0.7
The most interesting aspect of machine learning is exciting remarkable while its implications.
The most interesting aspect of machine learning is remarkable to consider how exciting the approach is.
Top-K
Parameters: k=50, temperature=1
The most interesting aspect of machine learning is perhaps maybe when applied correctly.
The most interesting aspect of machine learning is potentially conceivably when applied correctly.
Nucleus
Parameters: p=0.9, temperature=1
The most interesting aspect of machine learning is revolutionary groundbreaking when applied correctly.
The most interesting aspect of machine learning is revolutionary creative when applied correctly.
Combined
Parameters: p=0.9, k=50, temperature=0.7
The most interesting aspect of machine learning is compelling compelling while its implications.
The most interesting aspect of machine learning is compelling to consider how captivating the approach is.
Method Characteristics:
Greedy: Deterministic, fluent but limited diversity
Beam Search: More comprehensive exploration, still deterministic
Temperature: Controls randomness, higher = more diverse
Top-K: Prevents low-probability selections
Nucleus: Adaptively selects token pool
Combined: Balanced quality and diversity

Different methods produce different outputs from the same prompt. The optimal sampling strategy depends on your specific application and requirements for creativity vs. predictability.

Advanced Techniques and Combinations

The Production Recipe: Combined Sampling

Most production systems combine multiple techniques for optimal results:

python
def production_sampling(model, tokenizer, prompt, top_k=50, top_p=0.9, temperature=0.7, repetition_penalty=1.2, max_length=100): """ Production-ready sampling combining multiple techniques. """ input_ids = tokenizer.encode(prompt, return_tensors="pt") generated = input_ids[0].tolist() past_tokens = set(generated) # For repetition penalty

Handling Repetition

Repetition is a common issue in probabilistic sampling. Several techniques help:

Repetition Penalty

Reduce probability of recently used tokens:

Repetition Penalty Visualization

Prompt:
The cat sat on the mat. The cat
About Repetition Penalty:

Repetition penalty reduces the probability of tokens that have already appeared in the generated text, helping prevent repetitive loops and encouraging more diverse output.

Original Token Distribution:
No Penalty (1.0)
High probability of repetition
Penalty: 1.2
Penalty: 1.2
cat probability: 35.7%
Sample continuation:
The cat sat on the mat. The cat cat...
Penalty: 1.5
Penalty: 1.5
cat probability: 30.8%
Sample continuation:
The cat sat on the mat. The cat cat...

Higher penalty values more aggressively reduce the probability of repeated tokens. A value of 1.0 means no penalty is applied, while values above 1.0 increasingly penalize repetition.

python
# Simple repetition penalty implementation def apply_repetition_penalty(logits, past_tokens, penalty=1.2): for token_id in past_tokens: if logits[token_id] > 0: logits[token_id] /= penalty else: logits[token_id] *= penalty return logits

Frequency and Presence Penalties

  • Frequency penalty: Penalize based on how often a token appears
  • Presence penalty: Penalize any token that has appeared at all

Parameter Recommendations by Use Case

Use CaseTemperatureTop-KTop-PRepetition PenaltyNotes
Chat Assistant0.7500.91.1Balanced and helpful
Creative Writing0.91000.951.2Encourage creativity
Technical Docs0.3300.81.0Prioritize accuracy
News Articles0.6400.851.15Professional tone
Code Generation0.2200.71.0Syntax correctness
Poetry1.11500.971.3Maximum creativity

Practical Implementation with Hugging Face

The Transformers library makes advanced sampling easy:

python
from transformers import pipeline, set_seed # Set up the pipeline generator = pipeline('text-generation', model='gpt2-medium') set_seed(42) # For reproducible examples prompt = "The future of artificial intelligence will" # Temperature sampling temp_output = generator(

Key Hugging Face Parameters

  • do_sample=True: Enable probabilistic sampling
  • temperature: Control randomness (0.1-2.0)
  • top_k: Limit to top-k tokens (0 = disabled)
  • top_p: Nucleus sampling threshold (0.0-1.0)
  • repetition_penalty: Penalize repeated tokens (1.0-2.0)
  • num_return_sequences: Generate multiple outputs

Troubleshooting and Debugging

Common Issues and Solutions

Issue 1: Outputs Too Random/Incoherent

python
# Symptoms: Nonsensical text, grammar errors # Solutions: - Lower temperature (try 0.5-0.7) - Reduce top_p (try 0.8-0.9) - Reduce top_k (try 30-50) - Check if model is appropriate for task

Issue 2: Outputs Too Boring/Repetitive

python
# Symptoms: Generic responses, repeated phrases # Solutions: - Increase temperature (try 0.8-1.0) - Increase top_p (try 0.9-0.95) - Increase top_k (try 80-150) - Add repetition penalty (try 1.1-1.3)

Issue 3: Inconsistent Quality

python
# Symptoms: Some outputs great, others terrible # Solutions: - Generate multiple samples and filter - Use more conservative parameters - Add post-processing validation - Consider fine-tuning for your domain

Issue 4: Repetition Despite Penalties

python
# Symptoms: Still getting repetitive text # Solutions: - Increase repetition penalty (try 1.3-1.5) - Implement frequency penalty - Use presence penalty - Check for training data artifacts

Parameter Tuning Process

  1. Start with defaults: temperature=0.7, top_p=0.9, top_k=50
  2. Adjust temperature first: Control overall creativity level
  3. Fine-tune filtering: Adjust top_p/top_k for quality
  4. Add repetition handling: If needed for your use case
  5. Test extensively: Use diverse prompts and evaluate outputs

Evaluating Sampling Quality

Automated Metrics

python
# Simple quality metrics you can implement def evaluate_generation_quality(texts): metrics = {} # Diversity: unique n-grams all_bigrams = set() all_trigrams = set() for text in texts: words = text.split() all_bigrams.update(zip(words, words[1:]))

Human Evaluation Framework

  1. Fluency: Is the text grammatically correct?
  2. Coherence: Does it make logical sense?
  3. Relevance: Does it address the prompt appropriately?
  4. Creativity: Is it interesting and non-generic?
  5. Appropriateness: Is it suitable for the intended use?

Summary

What We've Learned

  1. Temperature sampling: Control creativity with a single parameter
  2. Top-k sampling: Limit choices to reasonable options
  3. Nucleus sampling: Adaptive, context-aware token selection
  4. Combined approaches: Production-ready systems using multiple techniques
  5. Parameter tuning: Guidelines for different use cases
  6. Common issues: How to debug and fix sampling problems

The Complete Sampling Toolkit

You now have the complete toolkit for text generation:

Deterministic Methods (previous lesson):

  • Greedy search: Fast, reliable, predictable
  • Beam search: Higher quality, still deterministic

Probabilistic Methods (this lesson):

  • Temperature: Creativity dial
  • Top-k: Smart choice limitation
  • Nucleus: Adaptive selection
  • Combined: Production-ready systems

When to Use What

ScenarioRecommended ApproachKey Parameters
Factual Q&ALow temperaturetemp=0.2, top_p=0.8
Creative WritingNucleus samplingtemp=0.9, top_p=0.95
Chat AssistantBalanced combinationtemp=0.7, top_k=50, top_p=0.9
Code GenerationConservative samplingtemp=0.3, top_k=30
BrainstormingHigh creativitytemp=1.1, top_p=0.97

Practice Exercises

Exercise 1: Parameter Exploration

Create a simple interface that lets you adjust temperature, top-k, and top-p parameters in real-time. Generate text with the same prompt using different settings and analyze the differences.

Exercise 2: Use Case Optimization

Choose a specific use case (e.g., writing product descriptions, generating study notes, creating story outlines) and systematically tune parameters to optimize for that task.

Exercise 3: Quality Evaluation

Implement automated metrics to evaluate generation quality. Compare different sampling methods on dimensions like diversity, fluency, and relevance.

Exercise 4: Repetition Handling

Experiment with different repetition penalty values and strategies. Create examples where repetition is problematic and show how to fix it.

Exercise 5: Production System

Build a complete text generation system that:

  • Takes user prompts
  • Allows parameter adjustment
  • Generates multiple candidates
  • Includes basic quality filtering
  • Handles edge cases gracefully

Additional Resources