×
Study Notes — Certification Prep

Retrieval-Augmented Generation (RAG)
Study Guide

Comprehensive overview of RAG indexing, retrieval, generation, vectors, and agentic AI.

Updated: April 2026
Version: 1.0
Category: RAG
Reading Time: ~11 min
Author: Michaël Bettan
01

Definitions & Architecture

What is RAG?

Retrieval-Augmented Generation (RAG) is one of the two dominating architectural patterns for context construction in AI applications (alongside Agentic AI). It functions analogously to feature engineering in classical ML by dynamically injecting external, task-specific information directly into the Large Language Model's context window at inference time. This combines the LLM's reasoning capabilities with external, proprietary data stores to generate highly accurate, factually grounded, and contextually relevant responses.

What RAG Solves

RAG solves for the limitations of static LLMs — training data cutoffs, lack of private enterprise data, and hallucinations. Critically, fine-tuning teaches a model new behavioral styles and task formats, but does not reliably inject dynamic factual knowledge and cannot prevent hallucination on data that changes after training. RAG is the architecturally correct solution for factual grounding at inference time — injecting verifiable, source-attributed context directly into the prompt without retraining. It also improves token efficiency by supplying only the most relevant chunks per query.

High-Level RAG Architecture Flow
Data Sources
(PDFs, DBs)
Chunking &
Embedding
Vector Database
(Index)
Semantic Search
(Retrieval)
User Question
Context Assembly
(Prompt)
Large Language Model
(Generation)
Final Answer
02

Stage 1: RAG Indexing (Offline Pre-processing)

Loaders
Extract data from unstructured sources (PDFs, HTML, Word, JSON) using tools like LangChain Document Loaders.
Splitters (Chunking)
Break large texts into digestible chunks. RecursiveCharacterTextSplitter is preferred as it preserves semantic coherence by splitting on paragraphs, then sentences, then words.
Chunking Strategy: Overlapping
Adjacent chunks share a defined number of tokens at boundaries. Prevents the loss of critical context that sits at a chunk boundary; ensures complete thoughts split across chunks are captured.
Chunking Strategy: Semantic
Splits on logical/topical boundaries (paragraph shifts, section headers) rather than token count — ensures one complete thought per chunk.
Chunking Strategy: Hierarchical
Small "child" chunks are embedded/indexed for precise matching; metadata links retrieve the larger "parent" chunk for LLM context — decouples retrieval granularity from generation context size.
Embedding
Convert text chunks into high-dimensional mathematical representations (vectors) using models like OpenAI Embeddings or SentenceTransformers.
Vector Store
Store vectors and metadata in specialized databases (e.g., Chroma, FAISS, pgvector, Pinecone).
Contextual Retrieval
Chunk is augmented with metadata at indexing time (doc title, section summary, tags). Gives the embedding model richer signal during retrieval, dramatically improving precision and reducing hallucination.

Chunk Size Trade-off

03

Stage 2 & 3: Retrieval & Generation

Retrieval (Real-time): The user query is vectorized using the same embedding model. A similarity search compares the query vector to stored vectors to find the Top-K most relevant chunks.

1. Query Transformation (Pre-Retrieval)

Acts as a hallucination prevention layer:

  • Query Rewriting: An AI model reformulates an ambiguous user query into a clearer, self-contained version before retrieval — directly reduces retrieval failure from poorly phrased inputs.
  • HyDE (Hypothetical Document Embeddings): The LLM generates a hypothetical ideal answer; that answer is embedded and used for retrieval instead of the raw query — bridges the vocabulary gap.
  • Multi-Query Retrieval: The LLM generates N alternative rephrasings of the query, runs parallel retrievals, and unions results — reduces single-query retrieval failure.

2. Reranking (Post-Retrieval)

  • Top-K retrieved chunks are scored a second time by a Cross-Encoder model (e.g., Cohere Rerank, BGE-Reranker) that evaluates the (query, chunk) pair jointly.
  • Unlike bi-encoders that embed independently, cross-encoders attend to both simultaneously, producing significantly more precise relevance scores.
  • Only the Top-N chunks (N < K) pass to the LLM. Can also rerank by recency for time-sensitive queries. Directly reduces hallucination by filtering topically similar but factually irrelevant chunks.

Stage 3: RAG Generation

The retrieved chunks are formatted and injected into a Prompt Template alongside the user query ("hydrating" the prompt). The LLM processes this context to generate a final, grounded answer.

Context Ordering (Lost-in-the-Middle)

LLMs exhibit positional bias — they attend most strongly to content at the beginning and end of the context window and systematically underutilize middle content. Best practice: place highest-relevance chunks first and last. Failure to order context correctly causes hallucination despite successful retrieval — the correct answer exists in the context but the model ignores it.

04

RAG vs Fine-tuning

Approach Best for Mechanism
RAG Facts / Information Accessing up-to-date data, proprietary knowledge, grounding responses in retrieved evidence.
Fine-tuning Form / Style Output format (JSON/YAML), brand voice, specialized jargon, instruction-following patterns.
Both combined Production systems RAG supplies the facts; fine-tuning ensures the correct format and tone.
05

Vectors & Similarity Search

Vector Space (Latent Space): A mathematical construct where semantically similar concepts are located closer together. Distance between vectors determines relevance.

Euclidean (L2)
Shortest straight-line distance. Lower = more similar.
Dot Product (Inner Product)
Measures magnitude of projection. Higher positive = more similar.
Cosine Similarity
Measures the cosine of the angle between two vectors, ignoring magnitude. Ranges from −1 to 1; higher = more similar. Default metric in most vector stores for semantic search.
Cosine Distance
Calculated as 1 − Cosine Similarity. Ranges from 0 to 2; lower = more similar. Used only when a distance metric is explicitly required by the algorithm.
06

Search Paradigms & Algorithms

Search Paradigms

Reciprocal Rank Fusion (RRF)
RRF(d) = Σ 1/(k + rank_m(d))
Fuses results by rank position, not raw scores (necessary because BM25 scores are unbounded while cosine scores range -1 to 1). Constant k=60 smooths outlier rankings.
k-NN (k-Nearest Neighbors)
Brute-force, exact distance calculation. Highly accurate but computationally expensive; scales poorly (O(n*d)).
ANN (Approximate Nearest Neighbors)
Sacrifices slight accuracy for massive speed gains using indexing structures.
HNSW (HNSWlib)
Multi-layer proximity graph — upper layers for fast coarse navigation, lower layers for dense local precision. Best speed-accuracy tradeoff. O(log n) complexity.
FAISS
Industry-standard ANN library by Facebook supporting IVF, PQ, and other methods. Powers production vector stores like Chroma and pgvector.
IVF (Inverted File Index)
K-means clustering organizes vectors into clusters; search targets only nearest clusters — dramatically reduces search space without scanning all vectors.
LSH (Locality-Sensitive Hashing)
Hashes similar vectors into the same buckets for fast candidate retrieval.
PQ (Product Quantization)
Compresses vectors into lower-dimensional representations — reduces memory footprint and speeds distance computations.
Annoy & ScaNN
Annoy builds multiple binary trees. ScaNN is optimized for high-dimensional semantic search at production scale.
07

Agentic AI & LangGraph

Agents transform LLMs from stateless responders to autonomous, reasoning systems capable of executing multi-step workflows (e.g., ReAct: Reason + Act). LangGraph supports both DAGs and cyclical graphs — cycles enable iterative reasoning loops (e.g., retrieve → grade → re-query).

LangGraph Concepts

  • State: The agent's "scratchpad" (memory/message history) shared across all steps.
  • Nodes: Functions or tools the agent executes (e.g., search_web, retrieve_docs).
  • Edges: The control flow between nodes.
  • Conditional Edges: Decision points where the LLM routes the flow dynamically.
  • Tools & Toolkits: Executable functions the agent can invoke. LLMs are given Tool Schemas (inputs) and Descriptions to decide when/how to use them.

CoALA Framework Memory

Cognitive Architectures for Language Agents uses 4 memory types:

  • Working Memory (Short-Term): The active workspace; holds immediate context.
  • Episodic Memory (Events): Case-based reasoning. Stores specific past experiences.
  • Semantic Memory (Facts): Factual knowledge base. Generally read-only to prevent hallucination-driven overwriting.
  • Procedural Memory (Skills): Encodes operational workflows/rules. Updated via Reflection (meta-prompting).

Memory Scopes

Personal/User-Specific: Strictly isolated data tailored to individual preferences.
Community/Public: Anonymized, aggregated insights shared across all users (e.g., discovering a universal troubleshooting fix).

08

Knowledge Graphs (Graph-Based RAG)

Surpasses flat vector stores by providing structured, navigable context built on directed property graphs.

Ontology
Formal representation of domain knowledge. Defined using languages like RDFS (basic taxonomies) and OWL (complex logical constraints and reasoning). Built via tools like Protégé.
Graph Components
Nodes (entities like AAPL, SEC), Edges (relationships like issuedBy, isRegulatedBy), and Properties.
Hybrid Embeddings
Flattens graph topology + text into rich descriptions (e.g., "Stock AAPL issued by Apple Inc. regulated by SEC") before vectorization.
Value Add
Enables multi-hop reasoning (connecting disparate facts A→B→C), guarantees traceability/explainability, and perfectly filters queries using strict schema logic (Cypher).
09

Performance & Evaluation

Semantic Caches (Interceptor Layer)

Catches and serves repeat queries instantly, bypassing expensive LLM reasoning and generation steps. Reduces latency and inference costs.

  • Entity Masking: Replaces specific values with variables (e.g., "AAPL price" -> "[TICKER] price") so one cache entry covers variations.
  • Cross-Encoder Verification: Secondary check to ensure cached query and new query have exact same intent.
  • Adaptive Thresholds: Strict (0.90) for high-stakes; looser (0.75) for exploratory.
  • Eviction Policies: Uses TTL and LRU with semantic decay to prevent Memory Bloat and Sclerosis.

Evaluation Frameworks (Ragas)

Systematic, objective measurement against Ground Truth data (synthetically generated).

  • Retrieval Metrics: Context Precision (signal-to-noise ratio), Context Recall.
  • Faithfulness: Proportion of claims directly entailed by retrieved context (primary anti-hallucination metric).
  • End-to-End: Factual Correctness, Semantic Similarity.
Ragas Faithfulness Score
Faithfulness = (# claims supported by context) ÷ (total # claims in answer)
A score of 1.0 = zero hallucination relative to the provided context. Identifies exactly how much the model 'drifted' beyond its grounded sources.
10

Advanced Concepts & Failure Modes

Multimodal RAG
CLIP embeddings encode both text and images into a shared vector space — enabling retrieval of relevant images, charts, or diagrams alongside text chunks.
Tabular RAG (Text-to-SQL)
Natural language query → LLM generates SQL → SQL executes against structured database → result injected into generation prompt. Answers over structured enterprise data without vectorization.

RAG vs. Long Context Windows

Why RAG remains architecturally superior for production even with 1M+ token windows:

When RAGs Can Fail (and How to Fix Them)

Stage Hallucination Cause RAG Fix
Indexing Missing context signal in chunks Contextual Retrieval
Pre-Retrieval Ambiguous or narrow user query Query Rewriting / HyDE / Multi-Query
Retrieval ANN returns approximate matches Reranking (Cross-Encoder)
Context Assembly Correct chunk in ignored middle position Context Ordering
Generation Model drifts beyond retrieved context Faithfulness score < 1.0 / Guardrails
Knowledge Freshness Static training data cutoff Real-time external memory injection

Self-Assessment Questions

Q1. Why is RAG still necessary even with models having massive context windows (1M+ tokens)?

Because massive windows increase "Lost-in-the-Middle" hallucinations, are unscalably expensive for every query, and degrade inference latency. RAG provides a small, dense payload.

Q2. What is the difference between Dense (Semantic) Search and Sparse (Keyword) Search in RAG?

Dense search matches meaning and context using vectors but struggles with exact names/IDs. Sparse search (BM25) matches exact terms but is blind to synonyms.

Q3. What does "Contextual Retrieval" involve at indexing time?

Augmenting chunks with metadata like document title, section summary, and tags to give the embedding model richer signal, improving precision.