Comprehensive overview of RAG indexing, retrieval, generation, vectors, and agentic AI.
Updated: April 2026
Version: 1.0
Category: RAG
Reading Time: ~11 min
Author: Michaël Bettan
01
Definitions & Architecture
What is RAG?
Retrieval-Augmented Generation (RAG) is one of the two dominating architectural patterns for context construction in AI applications (alongside Agentic AI). It functions analogously to feature engineering in classical ML by dynamically injecting external, task-specific information directly into the Large Language Model's context window at inference time. This combines the LLM's reasoning capabilities with external, proprietary data stores to generate highly accurate, factually grounded, and contextually relevant responses.
What RAG Solves
RAG solves for the limitations of static LLMs — training data cutoffs, lack of private enterprise data, and hallucinations.
Critically, fine-tuning teaches a model new behavioral styles and task formats, but does not reliably inject dynamic factual knowledge and cannot prevent hallucination on data that changes after training. RAG is the architecturally correct solution for factual grounding at inference time — injecting verifiable, source-attributed context directly into the prompt without retraining. It also improves token efficiency by supplying only the most relevant chunks per query.
High-Level RAG Architecture Flow
Data Sources (PDFs, DBs)
Chunking & Embedding
Vector Database (Index)
Semantic Search (Retrieval)
User Question
Context Assembly (Prompt)
Large Language Model (Generation)
Final Answer
02
Stage 1: RAG Indexing (Offline Pre-processing)
Loaders
Extract data from unstructured sources (PDFs, HTML, Word, JSON) using tools like LangChain Document Loaders.
Splitters (Chunking)
Break large texts into digestible chunks. RecursiveCharacterTextSplitter is preferred as it preserves semantic coherence by splitting on paragraphs, then sentences, then words.
Chunking Strategy: Overlapping
Adjacent chunks share a defined number of tokens at boundaries. Prevents the loss of critical context that sits at a chunk boundary; ensures complete thoughts split across chunks are captured.
Chunking Strategy: Semantic
Splits on logical/topical boundaries (paragraph shifts, section headers) rather than token count — ensures one complete thought per chunk.
Chunking Strategy: Hierarchical
Small "child" chunks are embedded/indexed for precise matching; metadata links retrieve the larger "parent" chunk for LLM context — decouples retrieval granularity from generation context size.
Embedding
Convert text chunks into high-dimensional mathematical representations (vectors) using models like OpenAI Embeddings or SentenceTransformers.
Vector Store
Store vectors and metadata in specialized databases (e.g., Chroma, FAISS, pgvector, Pinecone).
Contextual Retrieval
Chunk is augmented with metadata at indexing time (doc title, section summary, tags). Gives the embedding model richer signal during retrieval, dramatically improving precision and reducing hallucination.
Chunk Size Trade-off
Smaller chunks → higher retrieval precision, more diverse results — but greater computational overhead and risk of losing surrounding context.
Larger chunks → richer context per chunk, fewer retrieval calls — but lower precision and higher token cost injected into the LLM.
Hard constraint: chunk size must not exceed the maximum context length of either the embedding model or the generative model — whichever is smaller.
Overlapping chunks trade index size for boundary loss prevention.
03
Stage 2 & 3: Retrieval & Generation
Retrieval (Real-time): The user query is vectorized using the same embedding model. A similarity search compares the query vector to stored vectors to find the Top-K most relevant chunks.
1. Query Transformation (Pre-Retrieval)
Acts as a hallucination prevention layer:
Query Rewriting: An AI model reformulates an ambiguous user query into a clearer, self-contained version before retrieval — directly reduces retrieval failure from poorly phrased inputs.
HyDE (Hypothetical Document Embeddings): The LLM generates a hypothetical ideal answer; that answer is embedded and used for retrieval instead of the raw query — bridges the vocabulary gap.
Multi-Query Retrieval: The LLM generates N alternative rephrasings of the query, runs parallel retrievals, and unions results — reduces single-query retrieval failure.
2. Reranking (Post-Retrieval)
Top-K retrieved chunks are scored a second time by a Cross-Encoder model (e.g., Cohere Rerank, BGE-Reranker) that evaluates the (query, chunk) pair jointly.
Unlike bi-encoders that embed independently, cross-encoders attend to both simultaneously, producing significantly more precise relevance scores.
Only the Top-N chunks (N < K) pass to the LLM. Can also rerank by recency for time-sensitive queries. Directly reduces hallucination by filtering topically similar but factually irrelevant chunks.
Stage 3: RAG Generation
The retrieved chunks are formatted and injected into a Prompt Template alongside the user query ("hydrating" the prompt). The LLM processes this context to generate a final, grounded answer.
Source Citations: Production RAG systems append source citations (document title, chunk ID, page number) to generated responses — enabling human verification. This auditability is the key enterprise differentiator: RAG outputs are falsifiable, whereas bare LLM outputs are not.
Context Ordering (Lost-in-the-Middle)
LLMs exhibit positional bias — they attend most strongly to content at the beginning and end of the context window and systematically underutilize middle content. Best practice: place highest-relevance chunks first and last. Failure to order context correctly causes hallucination despite successful retrieval — the correct answer exists in the context but the model ignores it.
04
RAG vs Fine-tuning
Approach
Best for
Mechanism
RAG
Facts / Information
Accessing up-to-date data, proprietary knowledge, grounding responses in retrieved evidence.
Fine-tuning
Form / Style
Output format (JSON/YAML), brand voice, specialized jargon, instruction-following patterns.
Both combined
Production systems
RAG supplies the facts; fine-tuning ensures the correct format and tone.
05
Vectors & Similarity Search
Vector Space (Latent Space): A mathematical construct where semantically similar concepts are located closer together. Distance between vectors determines relevance.
Euclidean (L2)
Shortest straight-line distance. Lower = more similar.
Dot Product (Inner Product)
Measures magnitude of projection. Higher positive = more similar.
Cosine Similarity
Measures the cosine of the angle between two vectors, ignoring magnitude. Ranges from −1 to 1; higher = more similar. Default metric in most vector stores for semantic search.
Cosine Distance
Calculated as 1 − Cosine Similarity. Ranges from 0 to 2; lower = more similar. Used only when a distance metric is explicitly required by the algorithm.
06
Search Paradigms & Algorithms
Search Paradigms
Dense (Semantic) Search: Uses vector embeddings to match underlying meaning and context. Struggles with exact names, IDs, or acronyms and can return false positives.
Sparse (Keyword) Search: Uses BM25, which evaluates exact term matches by combining: (1) Term Frequency (TF) with saturation; (2) Inverse Document Frequency (IDF) to boost rare terms; (3) Document length normalization. Highly effective for proper nouns and IDs; blind to synonyms.
Hybrid Search: Fuses Dense and Sparse results to maximize both recall and precision.
Reciprocal Rank Fusion (RRF)
RRF(d) = Σ 1/(k + rank_m(d))
Fuses results by rank position, not raw scores (necessary because BM25 scores are unbounded while cosine scores range -1 to 1). Constant k=60 smooths outlier rankings.
k-NN (k-Nearest Neighbors)
Brute-force, exact distance calculation. Highly accurate but computationally expensive; scales poorly (O(n*d)).
ANN (Approximate Nearest Neighbors)
Sacrifices slight accuracy for massive speed gains using indexing structures.
HNSW (HNSWlib)
Multi-layer proximity graph — upper layers for fast coarse navigation, lower layers for dense local precision. Best speed-accuracy tradeoff. O(log n) complexity.
FAISS
Industry-standard ANN library by Facebook supporting IVF, PQ, and other methods. Powers production vector stores like Chroma and pgvector.
IVF (Inverted File Index)
K-means clustering organizes vectors into clusters; search targets only nearest clusters — dramatically reduces search space without scanning all vectors.
LSH (Locality-Sensitive Hashing)
Hashes similar vectors into the same buckets for fast candidate retrieval.
PQ (Product Quantization)
Compresses vectors into lower-dimensional representations — reduces memory footprint and speeds distance computations.
Annoy & ScaNN
Annoy builds multiple binary trees. ScaNN is optimized for high-dimensional semantic search at production scale.
07
Agentic AI & LangGraph
Agents transform LLMs from stateless responders to autonomous, reasoning systems capable of executing multi-step workflows (e.g., ReAct: Reason + Act). LangGraph supports both DAGs and cyclical graphs — cycles enable iterative reasoning loops (e.g., retrieve → grade → re-query).
LangGraph Concepts
State: The agent's "scratchpad" (memory/message history) shared across all steps.
Nodes: Functions or tools the agent executes (e.g., search_web, retrieve_docs).
Edges: The control flow between nodes.
Conditional Edges: Decision points where the LLM routes the flow dynamically.
Tools & Toolkits: Executable functions the agent can invoke. LLMs are given Tool Schemas (inputs) and Descriptions to decide when/how to use them.
CoALA Framework Memory
Cognitive Architectures for Language Agents uses 4 memory types:
Working Memory (Short-Term): The active workspace; holds immediate context.
Episodic Memory (Events): Case-based reasoning. Stores specific past experiences.
Semantic Memory (Facts): Factual knowledge base. Generally read-only to prevent hallucination-driven overwriting.
Procedural Memory (Skills): Encodes operational workflows/rules. Updated via Reflection (meta-prompting).
Memory Scopes
Personal/User-Specific: Strictly isolated data tailored to individual preferences. Community/Public: Anonymized, aggregated insights shared across all users (e.g., discovering a universal troubleshooting fix).
08
Knowledge Graphs (Graph-Based RAG)
Surpasses flat vector stores by providing structured, navigable context built on directed property graphs.
Ontology
Formal representation of domain knowledge. Defined using languages like RDFS (basic taxonomies) and OWL (complex logical constraints and reasoning). Built via tools like Protégé.
Graph Components
Nodes (entities like AAPL, SEC), Edges (relationships like issuedBy, isRegulatedBy), and Properties.
Hybrid Embeddings
Flattens graph topology + text into rich descriptions (e.g., "Stock AAPL issued by Apple Inc. regulated by SEC") before vectorization.
Value Add
Enables multi-hop reasoning (connecting disparate facts A→B→C), guarantees traceability/explainability, and perfectly filters queries using strict schema logic (Cypher).
09
Performance & Evaluation
Semantic Caches (Interceptor Layer)
Catches and serves repeat queries instantly, bypassing expensive LLM reasoning and generation steps. Reduces latency and inference costs.
Entity Masking: Replaces specific values with variables (e.g., "AAPL price" -> "[TICKER] price") so one cache entry covers variations.
Cross-Encoder Verification: Secondary check to ensure cached query and new query have exact same intent.
Adaptive Thresholds: Strict (0.90) for high-stakes; looser (0.75) for exploratory.
Eviction Policies: Uses TTL and LRU with semantic decay to prevent Memory Bloat and Sclerosis.
Evaluation Frameworks (Ragas)
Systematic, objective measurement against Ground Truth data (synthetically generated).
Faithfulness = (# claims supported by context) ÷ (total # claims in answer)
A score of 1.0 = zero hallucination relative to the provided context. Identifies exactly how much the model 'drifted' beyond its grounded sources.
10
Advanced Concepts & Failure Modes
Multimodal RAG
CLIP embeddings encode both text and images into a shared vector space — enabling retrieval of relevant images, charts, or diagrams alongside text chunks.
Tabular RAG (Text-to-SQL)
Natural language query → LLM generates SQL → SQL executes against structured database → result injected into generation prompt. Answers over structured enterprise data without vectorization.
RAG vs. Long Context Windows
Why RAG remains architecturally superior for production even with 1M+ token windows:
The "Lost-in-the-Middle" Phenomenon: Massive documents drastically increase the likelihood the LLM will hallucinate or forget middle information. RAG front-loads only the most relevant chunks.
Token Efficiency and Cost: Sending a 500,000-token payload for every query is unscalable. RAG filters corpus down to a small, dense payload (e.g., 2,000 tokens).
Inference Latency: As context size grows, Time to First Token (TTFT) degrades due to attention mechanisms scaling quadratically. RAG ensures sub-second latency.
Dynamic Data & Tool Execution: Agentic RAG pulls in data on-demand; injecting entire database states exhausts the window.
Severed Context: Rigid token limits arbitrarily cut sentences in half. ↳ Fix:Semantic Chunking and Hierarchical Chunking.
Context Drift / Hallucination: The LLM uses the right context but still invents outside facts. ↳ Fix:Strict prompt guardrails and continuous evaluation (Ragas Faithfulness).
Multi-Hop Reasoning Failure: Standard vectors cannot structurally connect related concepts across multiple documents. ↳ Fix:GraphRAG / Knowledge Graphs (A → B → C).
Semantic Cache Errors: Serving a cached answer for a visually similar but factually different query. ↳ Fix:Entity Masking and Cross-Encoder Verification.
Stage
Hallucination Cause
RAG Fix
Indexing
Missing context signal in chunks
Contextual Retrieval
Pre-Retrieval
Ambiguous or narrow user query
Query Rewriting / HyDE / Multi-Query
Retrieval
ANN returns approximate matches
Reranking (Cross-Encoder)
Context Assembly
Correct chunk in ignored middle position
Context Ordering
Generation
Model drifts beyond retrieved context
Faithfulness score < 1.0 / Guardrails
Knowledge Freshness
Static training data cutoff
Real-time external memory injection
Self-Assessment Questions
Q1. Why is RAG still necessary even with models having massive context windows (1M+ tokens)?
Because massive windows increase "Lost-in-the-Middle" hallucinations, are unscalably expensive for every query, and degrade inference latency. RAG provides a small, dense payload.
Q2. What is the difference between Dense (Semantic) Search and Sparse (Keyword) Search in RAG?
Dense search matches meaning and context using vectors but struggles with exact names/IDs. Sparse search (BM25) matches exact terms but is blind to synonyms.
Q3. What does "Contextual Retrieval" involve at indexing time?
Augmenting chunks with metadata like document title, section summary, and tags to give the embedding model richer signal, improving precision.