Study Notes — Certification Prep

Retrieval-Augmented Generation (RAG)
Study Guide

Comprehensive overview of RAG indexing, retrieval, generation, vectors, and agentic AI.

Updated: April 2026

Version: 1.0

Category: RAG

Reading Time: ~11 min

Author: Michaël Bettan

Definitions & Architecture

What is RAG?

Retrieval-Augmented Generation (RAG) is one of the two dominating architectural patterns for context construction in AI applications (alongside Agentic AI). It functions analogously to feature engineering in classical ML by dynamically injecting external, task-specific information directly into the Large Language Model's context window at inference time. This combines the LLM's reasoning capabilities with external, proprietary data stores to generate highly accurate, factually grounded, and contextually relevant responses.

What RAG Solves

RAG solves for the limitations of static LLMs — training data cutoffs, lack of private enterprise data, and hallucinations. Critically, fine-tuning teaches a model new behavioral styles and task formats, but does not reliably inject dynamic factual knowledge and cannot prevent hallucination on data that changes after training. RAG is the architecturally correct solution for factual grounding at inference time — injecting verifiable, source-attributed context directly into the prompt without retraining. It also improves token efficiency by supplying only the most relevant chunks per query.

High-Level RAG Architecture Flow

Data Sources
(PDFs, DBs)

Chunking &
Embedding

Vector Database
(Index)

Semantic Search
(Retrieval)

User Question

Context Assembly
(Prompt)

Large Language Model
(Generation)

Final Answer

Stage 1: RAG Indexing (Offline Pre-processing)

Loaders

Extract data from unstructured sources (PDFs, HTML, Word, JSON) using tools like LangChain Document Loaders.

Splitters (Chunking)

Break large texts into digestible chunks. RecursiveCharacterTextSplitter is preferred as it preserves semantic coherence by splitting on paragraphs, then sentences, then words.

Chunking Strategy: Overlapping

Adjacent chunks share a defined number of tokens at boundaries. Prevents the loss of critical context that sits at a chunk boundary; ensures complete thoughts split across chunks are captured.

Chunking Strategy: Semantic

Splits on logical/topical boundaries (paragraph shifts, section headers) rather than token count — ensures one complete thought per chunk.

Chunking Strategy: Hierarchical

Small "child" chunks are embedded/indexed for precise matching; metadata links retrieve the larger "parent" chunk for LLM context — decouples retrieval granularity from generation context size.

Embedding

Convert text chunks into high-dimensional mathematical representations (vectors) using models like OpenAI Embeddings or SentenceTransformers.

Vector Store

Store vectors and metadata in specialized databases (e.g., Chroma, FAISS, pgvector, Pinecone).

Contextual Retrieval

Chunk is augmented with metadata at indexing time (doc title, section summary, tags). Gives the embedding model richer signal during retrieval, dramatically improving precision and reducing hallucination.

Chunk Size Trade-off

Smaller chunks → higher retrieval precision, more diverse results — but greater computational overhead and risk of losing surrounding context.
Larger chunks → richer context per chunk, fewer retrieval calls — but lower precision and higher token cost injected into the LLM.
Hard constraint: chunk size must not exceed the maximum context length of either the embedding model or the generative model — whichever is smaller.
Overlapping chunks trade index size for boundary loss prevention.

Stage 2 & 3: Retrieval & Generation

Retrieval (Real-time): The user query is vectorized using the same embedding model. A similarity search compares the query vector to stored vectors to find the Top-K most relevant chunks.

1. Query Transformation (Pre-Retrieval)

Acts as a hallucination prevention layer:

Query Rewriting: An AI model reformulates an ambiguous user query into a clearer, self-contained version before retrieval — directly reduces retrieval failure from poorly phrased inputs.
HyDE (Hypothetical Document Embeddings): The LLM generates a hypothetical ideal answer; that answer is embedded and used for retrieval instead of the raw query — bridges the vocabulary gap.
Multi-Query Retrieval: The LLM generates N alternative rephrasings of the query, runs parallel retrievals, and unions results — reduces single-query retrieval failure.

2. Reranking (Post-Retrieval)

Top-K retrieved chunks are scored a second time by a Cross-Encoder model (e.g., Cohere Rerank, BGE-Reranker) that evaluates the (query, chunk) pair jointly.
Unlike bi-encoders that embed independently, cross-encoders attend to both simultaneously, producing significantly more precise relevance scores.
Only the Top-N chunks (N < K) pass to the LLM. Can also rerank by recency for time-sensitive queries. Directly reduces hallucination by filtering topically similar but factually irrelevant chunks.

Stage 3: RAG Generation

The retrieved chunks are formatted and injected into a Prompt Template alongside the user query ("hydrating" the prompt). The LLM processes this context to generate a final, grounded answer.

Source Citations: Production RAG systems append source citations (document title, chunk ID, page number) to generated responses — enabling human verification. This auditability is the key enterprise differentiator: RAG outputs are falsifiable, whereas bare LLM outputs are not.

Context Ordering (Lost-in-the-Middle)

LLMs exhibit positional bias — they attend most strongly to content at the beginning and end of the context window and systematically underutilize middle content. Best practice: place highest-relevance chunks first and last. Failure to order context correctly causes hallucination despite successful retrieval — the correct answer exists in the context but the model ignores it.

RAG vs Fine-tuning

Approach	Best for	Mechanism
RAG	Facts / Information	Accessing up-to-date data, proprietary knowledge, grounding responses in retrieved evidence.
Fine-tuning	Form / Style	Output format (JSON/YAML), brand voice, specialized jargon, instruction-following patterns.
Both combined	Production systems	RAG supplies the facts; fine-tuning ensures the correct format and tone.

Vectors & Similarity Search

Vector Space (Latent Space): A mathematical construct where semantically similar concepts are located closer together. Distance between vectors determines relevance.

Euclidean (L2)

Shortest straight-line distance. Lower = more similar.

Dot Product (Inner Product)

Measures magnitude of projection. Higher positive = more similar.

Cosine Similarity

Measures the cosine of the angle between two vectors, ignoring magnitude. Ranges from −1 to 1; higher = more similar. Default metric in most vector stores for semantic search.

Cosine Distance

Calculated as 1 − Cosine Similarity. Ranges from 0 to 2; lower = more similar. Used only when a distance metric is explicitly required by the algorithm.

Search Paradigms & Algorithms

Search Paradigms

Dense (Semantic) Search: Uses vector embeddings to match underlying meaning and context. Struggles with exact names, IDs, or acronyms and can return false positives.
Sparse (Keyword) Search: Uses BM25, which evaluates exact term matches by combining: (1) Term Frequency (TF) with saturation; (2) Inverse Document Frequency (IDF) to boost rare terms; (3) Document length normalization. Highly effective for proper nouns and IDs; blind to synonyms.
Hybrid Search: Fuses Dense and Sparse results to maximize both recall and precision.

Reciprocal Rank Fusion (RRF)

RRF(d) = Σ 1/(k + rank_m(d))

Fuses results by rank position, not raw scores (necessary because BM25 scores are unbounded while cosine scores range -1 to 1). Constant k=60 smooths outlier rankings.

k-NN (k-Nearest Neighbors)

Brute-force, exact distance calculation. Highly accurate but computationally expensive; scales poorly (O(n*d)).

ANN (Approximate Nearest Neighbors)

Sacrifices slight accuracy for massive speed gains using indexing structures.

HNSW (HNSWlib)

Multi-layer proximity graph — upper layers for fast coarse navigation, lower layers for dense local precision. Best speed-accuracy tradeoff. O(log n) complexity.

FAISS

Industry-standard ANN library by Facebook supporting IVF, PQ, and other methods. Powers production vector stores like Chroma and pgvector.

IVF (Inverted File Index)

K-means clustering organizes vectors into clusters; search targets only nearest clusters — dramatically reduces search space without scanning all vectors.

LSH (Locality-Sensitive Hashing)

Hashes similar vectors into the same buckets for fast candidate retrieval.

PQ (Product Quantization)

Compresses vectors into lower-dimensional representations — reduces memory footprint and speeds distance computations.

Annoy & ScaNN

Annoy builds multiple binary trees. ScaNN is optimized for high-dimensional semantic search at production scale.

Agentic AI & LangGraph

Agents transform LLMs from stateless responders to autonomous, reasoning systems capable of executing multi-step workflows (e.g., ReAct: Reason + Act). LangGraph supports both DAGs and cyclical graphs — cycles enable iterative reasoning loops (e.g., retrieve → grade → re-query).

LangGraph Concepts

State: The agent's "scratchpad" (memory/message history) shared across all steps.
Nodes: Functions or tools the agent executes (e.g., search_web, retrieve_docs).
Edges: The control flow between nodes.
Conditional Edges: Decision points where the LLM routes the flow dynamically.
Tools & Toolkits: Executable functions the agent can invoke. LLMs are given Tool Schemas (inputs) and Descriptions to decide when/how to use them.

CoALA Framework Memory

Cognitive Architectures for Language Agents uses 4 memory types:

Working Memory (Short-Term): The active workspace; holds immediate context.
Episodic Memory (Events): Case-based reasoning. Stores specific past experiences.
Semantic Memory (Facts): Factual knowledge base. Generally read-only to prevent hallucination-driven overwriting.
Procedural Memory (Skills): Encodes operational workflows/rules. Updated via Reflection (meta-prompting).

Memory Scopes

Personal/User-Specific: Strictly isolated data tailored to individual preferences.
Community/Public: Anonymized, aggregated insights shared across all users (e.g., discovering a universal troubleshooting fix).

Knowledge Graphs (Graph-Based RAG)

Surpasses flat vector stores by providing structured, navigable context built on directed property graphs.

Ontology

Formal representation of domain knowledge. Defined using languages like RDFS (basic taxonomies) and OWL (complex logical constraints and reasoning). Built via tools like Protégé.

Graph Components

Nodes (entities like AAPL, SEC), Edges (relationships like issuedBy, isRegulatedBy), and Properties.

Hybrid Embeddings

Flattens graph topology + text into rich descriptions (e.g., "Stock AAPL issued by Apple Inc. regulated by SEC") before vectorization.

Value Add

Enables multi-hop reasoning (connecting disparate facts A→B→C), guarantees traceability/explainability, and perfectly filters queries using strict schema logic (Cypher).

Performance & Evaluation

Semantic Caches (Interceptor Layer)

Catches and serves repeat queries instantly, bypassing expensive LLM reasoning and generation steps. Reduces latency and inference costs.

Entity Masking: Replaces specific values with variables (e.g., "AAPL price" -> "[TICKER] price") so one cache entry covers variations.
Cross-Encoder Verification: Secondary check to ensure cached query and new query have exact same intent.
Adaptive Thresholds: Strict (0.90) for high-stakes; looser (0.75) for exploratory.
Eviction Policies: Uses TTL and LRU with semantic decay to prevent Memory Bloat and Sclerosis.

Evaluation Frameworks (Ragas)

Systematic, objective measurement against Ground Truth data (synthetically generated).

Retrieval Metrics: Context Precision (signal-to-noise ratio), Context Recall.
Faithfulness: Proportion of claims directly entailed by retrieved context (primary anti-hallucination metric).
End-to-End: Factual Correctness, Semantic Similarity.

Ragas Faithfulness Score

Faithfulness = (# claims supported by context) ÷ (total # claims in answer)

A score of 1.0 = zero hallucination relative to the provided context. Identifies exactly how much the model 'drifted' beyond its grounded sources.

Advanced Concepts & Failure Modes

Multimodal RAG

CLIP embeddings encode both text and images into a shared vector space — enabling retrieval of relevant images, charts, or diagrams alongside text chunks.

Tabular RAG (Text-to-SQL)

Natural language query → LLM generates SQL → SQL executes against structured database → result injected into generation prompt. Answers over structured enterprise data without vectorization.

RAG vs. Long Context Windows

Why RAG remains architecturally superior for production even with 1M+ token windows:

The "Lost-in-the-Middle" Phenomenon: Massive documents drastically increase the likelihood the LLM will hallucinate or forget middle information. RAG front-loads only the most relevant chunks.
Token Efficiency and Cost: Sending a 500,000-token payload for every query is unscalable. RAG filters corpus down to a small, dense payload (e.g., 2,000 tokens).
Inference Latency: As context size grows, Time to First Token (TTFT) degrades due to attention mechanisms scaling quadratically. RAG ensures sub-second latency.
Dynamic Data & Tool Execution: Agentic RAG pulls in data on-demand; injecting entire database states exhausts the window.

When RAGs Can Fail (and How to Fix Them)

Vocabulary/Entity Mismatch: Sparse search misses synonyms, dense search misses exact names/numbers.
↳ Fix: Hybrid Search + RRF.
Ambiguous Queries: Users ask vague questions lacking context.
↳ Fix: Query Rewriting or HyDE.
Severed Context: Rigid token limits arbitrarily cut sentences in half.
↳ Fix: Semantic Chunking and Hierarchical Chunking.
Context Drift / Hallucination: The LLM uses the right context but still invents outside facts.
↳ Fix: Strict prompt guardrails and continuous evaluation (Ragas Faithfulness).
Multi-Hop Reasoning Failure: Standard vectors cannot structurally connect related concepts across multiple documents.
↳ Fix: GraphRAG / Knowledge Graphs (A → B → C).
Semantic Cache Errors: Serving a cached answer for a visually similar but factually different query.
↳ Fix: Entity Masking and Cross-Encoder Verification.

Stage	Hallucination Cause	RAG Fix
Indexing	Missing context signal in chunks	Contextual Retrieval
Pre-Retrieval	Ambiguous or narrow user query	Query Rewriting / HyDE / Multi-Query
Retrieval	ANN returns approximate matches	Reranking (Cross-Encoder)
Context Assembly	Correct chunk in ignored middle position	Context Ordering
Generation	Model drifts beyond retrieved context	Faithfulness score < 1.0 / Guardrails
Knowledge Freshness	Static training data cutoff	Real-time external memory injection

Self-Assessment Questions

Q1. Why is RAG still necessary even with models having massive context windows (1M+ tokens)?

Because massive windows increase "Lost-in-the-Middle" hallucinations, are unscalably expensive for every query, and degrade inference latency. RAG provides a small, dense payload.

Q2. What is the difference between Dense (Semantic) Search and Sparse (Keyword) Search in RAG?

Dense search matches meaning and context using vectors but struggles with exact names/IDs. Sparse search (BM25) matches exact terms but is blind to synonyms.

Q3. What does "Contextual Retrieval" involve at indexing time?

Augmenting chunks with metadata like document title, section summary, and tags to give the embedding model richer signal, improving precision.

Retrieval-Augmented Generation (RAG)Study Guide

Definitions & Architecture

What is RAG?

What RAG Solves

Stage 1: RAG Indexing (Offline Pre-processing)

Chunk Size Trade-off

Stage 2 & 3: Retrieval & Generation

1. Query Transformation (Pre-Retrieval)

2. Reranking (Post-Retrieval)

Stage 3: RAG Generation

Context Ordering (Lost-in-the-Middle)

RAG vs Fine-tuning

Vectors & Similarity Search

Search Paradigms & Algorithms

Search Paradigms

Agentic AI & LangGraph

LangGraph Concepts

CoALA Framework Memory

Memory Scopes

Knowledge Graphs (Graph-Based RAG)

Performance & Evaluation

Semantic Caches (Interceptor Layer)

Evaluation Frameworks (Ragas)

Advanced Concepts & Failure Modes

RAG vs. Long Context Windows

When RAGs Can Fail (and How to Fix Them)

Self-Assessment Questions

Retrieval-Augmented Generation (RAG)
Study Guide