Production RAG Systems

Overview

While Large Language Models (LLMs) have revolutionized natural language processing with their ability to generate coherent text and reason across domains, they face fundamental limitations. LLMs can only access knowledge encoded in their parameters during training, leading to potential hallucinations, outdated information, and inability to access domain-specific knowledge.

Retrieval-Augmented Generation (RAG) addresses these limitations by combining the generative power of LLMs with the ability to retrieve and leverage external knowledge sources. By dynamically accessing relevant information during inference, RAG systems enhance model outputs with accuracy, currency, and verifiability that pure LLMs cannot achieve alone.

This lesson explores the foundations of RAG, its components, implementation approaches, and practical applications. We'll build intuitive understanding through analogies and visualizations, then gradually introduce more technical depth and hands-on implementation.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the motivation and principles behind Retrieval-Augmented Generation
  • Describe the core components of RAG systems: embedding generation, chunking, vector storage, retrieval, and generation
  • Implement a basic RAG system using popular libraries and tools
  • Evaluate and improve RAG performance through rerankers and other optimization techniques
  • Apply RAG to specific use cases and domains
  • Compare different RAG architectures and understand their trade-offs

Why RAG? Understanding the Need for External Knowledge

The Knowledge Access Problem

Large Language Models face several key limitations regarding knowledge:

  1. Static Knowledge: LLMs only "know" what they learned during training
  2. Knowledge Cutoff: Information after the training cutoff is inaccessible
  3. Hallucinations: Models may generate plausible but factually incorrect information
  4. Lack of Citations: Difficult to verify the source of generated information
  5. Domain Knowledge Gaps: Limited expertise in specialized domains

Analogy: The Expert Consultant with a Library

Think of an LLM as an expert consultant who has read many books but:

  • Cannot access any new books published after their last education
  • Must rely solely on memory for all facts and details
  • Has no way to verify their recollection against original sources
  • Cannot easily expand knowledge into new specialized domains

RAG transforms this consultant by providing:

  • A vast, current library that can be instantly searched
  • The ability to read specific sources before responding
  • Citations to verify information
  • Domain-specific resources that can be added on demand

From Memory-Only to Memory+Retrieval

AspectLLM OnlyLLM + RAG
Knowledge SourceParameters (frozen at training)Parameters + External documents
Information CurrencyTraining cutoff dateAs current as the knowledge base
Factual AccuracyVaries, prone to hallucinationHigher, based on retrieved context
VerifiabilityLow, no citationsHigh, can cite sources
Domain AdaptationRequires fine-tuningAdd domain documents to knowledge base
ComputationLower (generation only)Higher (retrieval + generation)
Memory UsageFixed model sizeModel + vector database

The RAG Architecture: A High-Level View

Core Components

RAG systems consist of two main phases:

  1. Indexing Phase: Prepare documents for efficient retrieval
  2. Query Phase: Retrieve relevant information and augment LLM generation

Document Processing and Embedding Generation

Document Chunking: The Art of Segmentation

Effective RAG requires breaking down documents into appropriately sized pieces (chunks) that:

  • Are small enough to be processed efficiently
  • Are large enough to retain meaningful context
  • Preserve semantic coherence of the content

Common Chunking Strategies

  1. Fixed-Size Chunking: Split by character or token count

    • Simple but may break semantic units
  2. Semantic Chunking: Split based on document structure

    • Paragraphs, sections, or headings
    • Preserves natural document organization
  3. Recursive Chunking: Split hierarchically

    • Preserve relationships between chunks
    • Handle nested document structures
  4. Sliding Window Chunking: Create overlapping chunks

    • Ensures context is preserved across chunk boundaries
    • Increases storage requirements

Embedding Generation: Turning Text into Vectors

Embeddings are numerical representations of text in a high-dimensional vector space, where semantic similarity is captured by vector proximity.

Choosing the Right Embedding Model

ModelDimensionsContext LengthPerformanceSpeedUse Case
OpenAI ada-00215368192HighMediumGeneral purpose
BERT768512MediumFastDomain-specific
E5-large1024512HighMediumRetrieval-optimized
Sentence-T5768512HighFastMultilingual
GTE-large1024512Very HighMediumMTEB leader
INSTRUCTOR768512HighMediumInstruction-tuned
BGE1024512Very HighMediumChinese + English

Analogy: Library Catalog System

Think of embeddings like a modern library catalog system:

  • Each document is assigned coordinates in a multidimensional space
  • Similar documents are placed near each other
  • When someone asks a question, the system finds documents at coordinates similar to the question
  • This allows quick retrieval without having to read through all documents

Vector Storage and Indexing

Vector databases store and index embeddings for efficient similarity search:

  1. Exact Nearest Neighbor Search:

    • Computes distances between query and all vectors
    • Accurate but slow for large collections
  2. Approximate Nearest Neighbor (ANN) Search:

    • Uses algorithms like HNSW, IVF, or LSH
    • Trades perfect accuracy for speed
    • Enables scalable similarity search

Common Vector Database Options

DatabaseTypeANN AlgorithmsHosting OptionsFeaturesUse Case
PineconeManagedHNSWCloud-onlyMetadata filtering, namespacesProduction ready
WeaviateFull-featuredHNSWSelf-host/CloudMulti-modal, classes, schemaComplex data models
ChromaLightweightHNSWSelf-host/EmbeddedSimple API, Python-nativeDevelopment
FAISSLibraryMultipleSelf-hostHigh performance, customizableResearch
QdrantFull-featuredHNSWSelf-host/CloudPayload filtering, clusteringProduction
MilvusFull-featuredMultipleSelf-host/CloudHybrid search, shardingLarge scale
pgvectorDatabase extensionIVFSelf-hostPostgreSQL integrationExisting PostgreSQL users

Retrieval Mechanisms: Finding the Right Context

Vector Search: Similarity Metrics

Different distance measures for finding similar vectors:

  1. Cosine Similarity:

    • Measures angle between vectors
    • Scale-invariant
    • Most common for text embeddings
    • Formula: cos(θ)=ABAB\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}||\mathbf{B}|}
  2. Euclidean Distance:

    • Measures straight-line distance
    • Affected by vector magnitude
    • Formula: d(A,B)=i(AiBi)2d(\mathbf{A}, \mathbf{B}) = \sqrt{\sum_i (A_i - B_i)^2}
  3. Dot Product:

    • Simple multiplication of vector elements
    • Not normalized
    • Formula: AB=iAiBi\mathbf{A} \cdot \mathbf{B} = \sum_i A_i B_i

Beyond Simple Retrieval: Advanced Techniques

1. Hybrid Search

Combines semantic search with keyword-based (sparse) search:

  • Semantic search captures meaning
  • Keyword search captures specific terms
  • Combined for better precision and recall

2. Reranking

Reranking applies a second, more computationally intensive model to improve retrieval quality:

  1. Initial retrieval fetches candidate documents (often 20-100)
  2. Reranker evaluates each candidate more thoroughly
  3. Documents are reordered based on relevance scores

Popular rerankers:

  • Cohere Rerank
  • BGE Reranker
  • UNI-COIL
  • MonoT5

3. Query Transformation

Techniques to improve the query before retrieval:

  1. Query Expansion:

    • Add related terms to the query
    • Example: "car" → "car automobile vehicle"
  2. HyDE (Hypothetical Document Embeddings):

    • Use LLM to generate a hypothetical perfect document
    • Embed this document as the query
  3. Multi-Query Retrieval:

    • Generate multiple perspectives on the query
    • Combine retrieval results
    • Increases recall at the cost of more processing

Prompt Engineering for RAG

Constructing Effective Prompts

The prompt structure for RAG typically includes:

  1. System Instructions: Define the role and behavior of the assistant
  2. Retrieved Context: External knowledge from vector search
  3. User Query: The original question or instruction
  4. Response Format: Structure for the model's output

Example RAG Prompt Template

python
def create_rag_prompt(query, context_docs, system_instruction=None): """ Create a RAG prompt with retrieved context. Args: query: User's query context_docs: Retrieved documents/passages system_instruction: Optional system instruction Returns:

Implementing a Basic RAG System

Setting Up a RAG Pipeline

Let's implement a complete RAG system using popular libraries:

python
import os from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import RetrievalQA from langchain.document_loaders import DirectoryLoader, TextLoader from langchain.llms import OpenAI # Set up environment os.environ["OPENAI_API_KEY"] = "sk-your-api-key" # Replace with your API key

More Sophisticated RAG Implementation

Here's a more advanced implementation with reranking:

python
import os from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader from langchain.llms import OpenAI from langchain.chat_models import ChatOpenAI from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import CohereRerank from langchain.retrievers.multi_query import MultiQueryRetriever

RAG Evaluation and Optimization

Evaluating RAG System Performance

Effective RAG evaluation should consider multiple dimensions:

  1. Retrieval Metrics:

    • Precision: Are retrieved documents relevant?
    • Recall: Are all relevant documents retrieved?
    • Mean Average Precision (MAP): Ranking quality
  2. Generation Quality Metrics:

    • Faithfulness: Does output align with retrieved information?
    • Answer Relevance: Does output address the query?
    • Groundedness: Is the output supported by evidence?
  3. End-to-End Metrics:

    • Correctness: Is the final answer factually correct?
    • Helpfulness: Does it solve the user's problem?
    • Latency: Is retrieval + generation time acceptable?
python
from ragas.metrics import ( faithfulness, answer_relevancy, context_relevancy, context_recall, context_precision ) from ragas.langchain import RagasEvaluatorChain from datasets import Dataset # Example evaluation data eval_data = [ {

Optimizing RAG Performance

Chunking Strategy Optimization

Chart Configuration

Advanced Optimization Techniques

  1. Metadata Filtering:

    • Add metadata to chunks (source, date, category)
    • Filter retrieval based on relevant metadata
    • Increases precision by limiting search scope
  2. Ensemble Retrieval:

    • Combine results from multiple retrieval methods
    • Different embedding models
    • Different chunking strategies
    • Weighted combination of search results
  3. Query-focused Chunking:

    • Dynamically adjust chunk size based on query complexity
    • Focus on semantic units like paragraphs for factual queries
    • Use larger chunks for conceptual or summary queries
  4. Contextual Compression:

    • Extract only relevant parts of retrieved chunks
    • Reduces noise in the context
    • Allows for more retrieved documents within context window

Advanced RAG Architectures

Beyond Basic RAG: Architectural Variations

  1. Multi-stage Retrieval:

    • Coarse retrieval → Fine-grained retrieval
    • Reduces computation while maintaining quality
  2. Recursive Retrieval:

    • Initial answer generates follow-up queries
    • Iteratively refine results with new retrievals
  3. Agent-based RAG:

    • System decides when to retrieve information
    • Multiple retrievers for different knowledge sources
    • Strategic decisions about what to retrieve

RAG Variants

ArchitectureDescriptionBenefitsLimitationsUse Cases
Standard RAGBasic retrieve-then-generate flowSimple, effective for many casesFixed retrieval approachGeneral Q&A, document assistants
Adaptive RAGDynamically adjusts retrieval strategyBetter performance across query typesHigher complexityDiverse query handling
Self-RAGModel decides when to retrieveReduced hallucination, more efficientRequires specialized trainingFactual domains, scientific applications
FLAREForward-Looking Active REtrievalIdentifies knowledge gaps during generationIncreased latencyComplex reasoning tasks
RARRRetrieval Augmented Retrieval & ReasoningImproved reasoning over documentsMulti-stage complexityLegal analysis, medical diagnosis
SILO RAGContext segmentation and specialized modelsBetter handling of long contextsHigher resource usageDocument analysis, complex reports

Domain-Specific RAG Adaptations

Customizing RAG for Different Domains

Different domains require specific RAG adaptations:

  1. Medical RAG:

    • Specialized medical embeddings
    • Entity-centric chunking (diseases, treatments)
    • Complex medical reasoning
  2. Legal RAG:

    • Citation-aware retrieval
    • Hierarchical document structure
    • Precedent-based reasoning
  3. Technical Documentation RAG:

    • Code-aware chunking
    • API documentation structure
    • Query reformulation for technical terms
  4. Academic Research RAG:

    • Citation graph awareness
    • Cross-paper connections
    • Scientific terminology handling

Practical Exercises

Exercise 1: Building a Basic RAG System

Implement a RAG system for a collection of Wikipedia articles:

  1. Load and chunk articles
  2. Create embeddings and store in a vector database
  3. Implement query processing and retrieval
  4. Connect to an LLM for generation
  5. Test with various questions

Exercise 2: Chunking Strategy Comparison

Compare different chunking strategies:

  1. Fixed-size chunking (500, 1000, 2000 characters)
  2. Semantic chunking (paragraphs, sections)
  3. Sliding window with different overlap percentages
  4. Evaluate retrieval quality for each approach

Exercise 3: Optimizing RAG with Reranking

Enhance a basic RAG system with rerankers:

  1. Start with a basic vector retrieval system
  2. Implement over-retrieval (fetch 10-20 documents)
  3. Add a reranker to prioritize the most relevant chunks
  4. Compare performance with and without reranking

Exercise 4: Multi-Query RAG

Implement a multi-query RAG system:

  1. Use an LLM to generate multiple query formulations
  2. Retrieve results for each formulation
  3. Combine results through reranking or ensemble methods
  4. Compare to single-query baseline

Summary

In this lesson, we've explored Retrieval-Augmented Generation (RAG) systems, which enhance LLM capabilities by connecting them to external knowledge sources. We've covered:

  1. The motivation and principles behind RAG:

    • Overcoming LLM knowledge limitations
    • Enhancing factuality and reducing hallucinations
    • Enabling domain specialization without full retraining
  2. Core RAG components and processes:

    • Document chunking strategies
    • Embedding generation and vector storage
    • Retrieval mechanisms and similarity metrics
    • Prompt engineering for effective augmentation
  3. Implementation approaches:

    • Building RAG pipelines with popular libraries
    • Advanced techniques like reranking and query transformation
    • Evaluation methods for RAG systems
  4. Advanced architectures and optimizations:

    • Beyond basic RAG: adaptive and recursive approaches
    • Domain-specific adaptations
    • Performance tuning and enhancement

RAG represents a significant advancement in the practical application of LLMs, enabling more accurate, current, and verifiable AI systems. By understanding the principles and techniques covered in this lesson, you're well-equipped to build RAG systems that leverage the strengths of both retrieval and generation approaches.

Additional Resources

Papers

Libraries and Tools

  • LangChain - Framework for building RAG applications
  • FAISS - Library for efficient similarity search
  • LlamaIndex - Data framework for RAG applications
  • Weaviate - Vector database
  • RAGAS - Evaluation framework for RAG

Tutorials and Blog Posts