Below is a cleaned, structured, and augmented version of my notes from the book “A simple Guide to Retrieval Augmented Generation”. It’s an excellent book.
1. Why RAG Exists (Problem Framing)
What RAG Solves
RAG exists to compensate for structural limitations of LLMs:
| LLM Limitation | Operational Impact |
|---|---|
| Static training data | Cannot answer questions about private or recent data |
| No true memory | Cannot reliably reference large corpora |
| Hallucination risk | Generates plausible but incorrect answers |
| Context window limits | Cannot ingest full documents |
RAG introduces non-parametric memory by grounding generation in retrieved, external data.
2. Canonical RAG Architecture (Mental Model)
Two Mandatory Pipelines
1. Indexing Pipeline (Offline / Batch)
Purpose: Prepare knowledge so it can be retrieved accurately and cheaply
Raw Data → Clean → Chunk → Embed → Store
Key characteristics:
- Runs ahead of time
- Optimized for recall and structure
- Changes infrequently relative to queries
2. Generation Pipeline (Online / Real-Time)
Purpose: Answer a user question using retrieved context
User Query → Retrieve → Augment → Generate → Validate
Key characteristics:
- Runs per request
- Optimized for latency and precision
- Heavily influenced by prompt design
3. Indexing Pipeline (Where Most RAG Systems Fail)
3.1 Data Loading
Definition: Extracting usable text + metadata from source systems.
Sources (Relevant to You):
- Vendor documentation (PDF, HTML)
- Configuration files
- Markdown / notes (Obsidian)
- RFCs / standards
- Internal runbooks
Critical Insight (Book Often Understates This)
Bad ingestion cannot be fixed downstream by better embeddings or prompts.
Best Practices:
- Normalize formats early (Markdown or plain text)
- Preserve source identifiers (filename, section, version)
- Add technical metadata (platform, version, domain)
3.2 Chunking (Context Engineering)
Chunking is not a mechanical step—it is information architecture.
Why Chunking Matters
- Determines what can be retrieved
- Controls noise vs recall
- Directly impacts hallucination rate
Practical Chunking Guidance
| Content Type | Recommended Strategy |
|---|---|
| Technical docs | Section-aware chunking |
| Config guides | Command-block + explanation |
| RFCs | Semantic + hierarchical |
| Notes | Logical topic blocks |
Key Parameters:
- Chunk size: 300–800 tokens (starting point)
- Overlap: 10–20%
- Metadata per chunk: mandatory
3.3 Embeddings (Representation Layer)
What They Actually Do: Embeddings encode semantic meaning, not keywords.
Implication:
- “BGP route reflector” ≈ “iBGP control plane optimization”
- Syntax-only similarity fails
Selection Criteria:
- Domain specificity (technical language matters)
- Dimensionality vs cost
- Stability (re-embedding cost is real)
Operational Rule
Do not change embedding models casually once indexed.
3.4 Vector Stores (Retrieval Substrate)
Vector DBs are indexes, not databases.
They optimize for:
- Approximate nearest neighbor search
- Low-latency similarity queries
Design Reality:
- Metadata filtering is as important as vector similarity
- Versioning your embeddings is mandatory for production
4. Retrieval (Information Retrieval, Not “Search”)
Retrieval Is a Ranking Problem
Given:
- Query vector
- Corpus vectors
Goal:
Return the most useful context, not the most similar text.
Retrieval Techniques (With Practical Guidance)
| Method | When to Use |
|---|---|
| TF-IDF / BM25 | Exact terms, logs, configs |
| Dense embeddings | Conceptual questions |
| Hybrid | Default for technical RAG |
| Cross-encoder | Re-ranking small candidate sets |
| Recursive | Multi-hop reasoning |
Production Insight
Most real systems use hybrid + re-ranking, not pure vector search.
5. Augmentation (Prompt Is the Control Plane)
Augmentation is where RAG becomes reliable—or not.
Prompt Construction Goals
- Constrain hallucinations
- Force grounding
- Preserve citation boundaries
High-Value Prompt Patterns
Context-Only Answering
“Answer using only the provided context. If insufficient, say so.”
Structured Output
Forces deterministic, parseable responses
Failure Acknowledgment
Reduces false confidence dramatically
6. Generation (LLM Selection Reality)
RAG Changes LLM Selection Criteria
You are not asking the model to know everything—only to:
- Read
- Synthesize
- Reason locally
Implication: Smaller models perform well if retrieval quality is high.
Fine-Tuning vs RAG
| Fine-Tuning | RAG |
|---|---|
| Changes model behavior | Supplies knowledge |
| Expensive to update | Cheap to update |
| Risk of forgetting | Safe |
| Good for style | Good for facts |
Best Practice: Use RAG first. Fine-tune only for format, tone, or workflow behavior.
7. Evaluation (What Actually Matters)
RAG-Specific Quality Dimensions
- Context Relevance
Did retrieval return the right material? - Faithfulness
Did the answer stay grounded? - Answer Relevance
Did it actually answer the question?
Capabilities You Must Test
| Capability | Why It Matters |
|---|---|
| Noise rejection | Prevents hallucinations |
| Negative answers | “I don’t know” correctness |
| Information synthesis | Multi-doc reasoning |
| Counterfactual robustness | Avoids confident errors |
8. Advanced RAG (When Naive RAG Breaks)
Why Naive RAG Fails
- Shallow retrieval
- Redundant chunks
- Fragmented context
- No query understanding
Advanced RAG Flow
Rewrite → Retrieve → Re-rank → Compress → Generate
Each step removes entropy.
High-Impact Advanced Techniques
| Technique | Benefit |
|---|---|
| Metadata filtering | Massive precision gains |
| Query rewriting | Better recall |
| Compression | Lower token cost |
| Re-ranking | Higher answer quality |
| Iterative retrieval | Multi-hop answers |
9. RAGOps (Production Reality)
RAG systems fail operationally, not academically.
Critical Layers (Non-Optional)
- Data ingestion & versioning
- Model serving
- Orchestration
- Monitoring
- Security (prompt injection is real)
Essential Layers (You Will Need These)
- Prompt management
- Evaluation pipelines
- Retrieval observability
- Cost tracking
10. RAG Variants (Use When Necessary)
| Variant | When It Makes Sense |
|---|---|
| Graph RAG | Relationship-heavy domains |
| Agentic RAG | Tool use, routing, automation |
| Corrective RAG | High-stakes accuracy |
| RAPTOR | Large corpora, thematic queries |
11. Mapping This to What You Are Trying to Do
Based on your prior conversations and lab goals:
Your Primary Objective
A local RAG system that can reason over technical documentation, configs, and notes, and validate best practices.
What Matters Most for Your Use Case
Tier 1 — Mandatory
- High-quality ingestion
- Section-aware chunking
- Metadata-rich embeddings
- Hybrid retrieval
- Strict grounding prompts
Tier 2 — Strongly Recommended
- Re-ranking
- Query rewriting
- Evaluation with faithfulness metrics
- Versioned indexes
Tier 3 — Optional (Later)
- Agentic RAG
- Graph RAG
- Speculative RAG
What You Can Safely De-Prioritize (For Now)
- Massive fine-tuning
- Multi-agent systems
- Long-context-only approaches
- Exotic benchmarks
Final Practitioner Takeaway
RAG is not an LLM problem. It is an information architecture problem.