RAG Foundations: What You Need to Know Before Implementing

Your LLM knows a lot. But it does not know your data.

The Problem RAG Solves

It is December 2025. You have access to models with 200,000+ token context windows. Claude can read entire codebases. GPT-4 Turbo can process books. Gemini 2.0 handles hour-long videos.

And yet, your enterprise chatbot still hallucinates your company's vacation policy.

Here is the fundamental tension: Large Language Models are trained on internet-scale data, frozen at a knowledge cutoff date, and have no awareness of your internal documents, databases, or proprietary knowledge. No matter how large the context window grows, you cannot paste your entire data warehouse into a prompt.

This is where Retrieval-Augmented Generation comes in. RAG is not just another AI buzzword. It is the architecture pattern that bridges the gap between what LLMs know and what your business needs them to know.

What is RAG? The Open Book Exam

The simplest mental model for RAG is the Open Book Exam.

Traditional LLMs take a Closed Book Exam. They answer questions using only what they memorized during training. If the answer was not in the training data, or if the information has changed since the knowledge cutoff, the model either confesses ignorance or hallucinates confidently.

RAG transforms this into an Open Book Exam. Before the model answers, it retrieves relevant documents from your knowledge base, reads them, and then generates a response grounded in that retrieved context.

The result: An LLM that can answer questions about your Q4 2025 sales figures, your company's expense reimbursement policy, or your product's latest API documentation, even though none of that existed when the model was trained.

The Three Core Components

RAG consists of three stages that happen on every query:

Diagram 1 from RAG Foundations: What You Need to Know Before Implementing

1. Retrieval: Given a user query, find the most relevant documents or passages from your knowledge base. This typically uses semantic search powered by vector embeddings.

2. Augmentation: Take the retrieved documents and inject them into the LLM's prompt as context. This is the "augmentation" that gives RAG its name.

3. Generation: The LLM generates a response, using both its pretrained knowledge and the retrieved context to answer the user's question.

When to Use RAG vs. Alternatives

RAG is powerful, but it is not always the right answer. Here is a decision framework for choosing between RAG and its alternatives.

RAG vs. Fine-tuning

This is the most common architectural decision, and it comes down to a simple distinction:

Fine-tuning teaches the model new behaviors: writing style, output format, domain-specific reasoning patterns
RAG provides the model with new knowledge: facts, documents, data that changes over time

Criterion	RAG	Fine-tuning
Knowledge updates	Instant (update the document)	Slow (retrain the model)
Citations/Sources	Can cite specific documents	Cannot attribute sources
Access control	Filter retrieval by permissions	Knowledge is baked in
Cost per query	Higher (retrieval + LLM)	Lower (single inference)
Setup complexity	Vector DB infrastructure	Training pipeline
Best for	Dynamic, factual knowledge	Style, format, reasoning

The Rule: If you need the model to know facts, use RAG. If you need the model to behave differently, consider fine-tuning. For a deeper exploration, see our article on RAG vs Fine-tuning.

RAG vs. Long Context Windows

With 200,000+ token context windows now available, why not just paste all your documents into the prompt?

Criterion	RAG	Long Context
Document volume	Unlimited (stored externally)	Limited by context window
Cost	Retrieval is cheap; only relevant chunks go to LLM	Tokens are expensive; paying for irrelevant content
Latency	Fast (retrieve k documents)	Slow for very long contexts
Precision	High (semantic matching)	Model must "find the needle"
Knowledge updates	Instant	Must rebuild prompt

The Nuance: Long context windows are excellent for documents you know the user will need, like "summarize this contract" or "answer questions about this codebase." RAG is better when the relevant documents are unknown upfront and must be discovered from a large corpus.

The two approaches are also complementary. Many production systems use RAG to retrieve documents, then pass them to a long-context model for deep analysis.

RAG vs. Traditional Search

Traditional keyword search (BM25, Elasticsearch) still works. Why add the complexity of vectors?

Criterion	RAG (Semantic)	Traditional Search
Query understanding	Conceptual matching	Exact keyword matching
Synonym handling	Automatic	Manual (synonyms list)
Answer generation	Synthesized, natural language	Document links/snippets
Setup complexity	Higher (embeddings, vector DB)	Lower (inverted index)

The Truth: Production RAG systems use both. Hybrid search combining semantic vectors with keyword matching typically outperforms either approach alone, especially for technical content with exact-match requirements. We cover this in depth in our Hybrid Search article.

The Decision Matrix

Your Situation	Recommended Approach
Need to answer questions from dynamic documents	RAG
Need model to write in specific style/format	Fine-tuning
User provides the document at query time	Long Context
Searching for specific documents by keyword	Traditional Search
Large corpus, unknown relevance	RAG + Hybrid Search
High accuracy on technical content	RAG + Reranking

RAG Architecture Deep Dive

A RAG system has two major pipelines: Indexing (offline, runs when documents change) and Query (online, runs on every user request).

The Indexing Pipeline

Diagram 2 from RAG Foundations: What You Need to Know Before Implementing

1. Document Loading: Ingest documents from their source format. This includes PDFs, Word documents, HTML pages, Notion databases, Confluence spaces, code repositories, and more. Each source requires specialized parsing.

2. Chunking: Split documents into smaller pieces that fit within embedding model context limits and provide focused, retrievable units. This is far more nuanced than it sounds. See our Chunking Strategies article for why "just split every 1000 characters" fails.

3. Embedding: Convert each chunk into a vector (a list of floating-point numbers) using an embedding model. These vectors capture semantic meaning, allowing "vacation policy" to match documents about "PTO" and "time off." Our Embeddings Deep Dive covers model selection.

4. Vector Storage: Store the vectors in a database optimized for similarity search. Options range from managed services (Pinecone) to open source (Weaviate, Qdrant) to PostgreSQL extensions (pgvector). See our Vector Database Comparison.

The Query Pipeline

Diagram 3 from RAG Foundations: What You Need to Know Before Implementing

1. Query Embedding: The user's question is converted to a vector using the same embedding model used during indexing. This is critical: you cannot mix vectors from different models.

2. Vector Search: The query vector is compared against all stored vectors to find the most similar chunks. This uses algorithms like HNSW (Hierarchical Navigable Small Worlds) for fast approximate nearest neighbor search.

3. Context Building: The retrieved chunks are assembled into a prompt, typically with a system message explaining the task, the retrieved context, and the user's original question.

4. LLM Generation: The augmented prompt is sent to an LLM, which generates a response grounded in the retrieved context. The model can cite sources, quote passages, or synthesize across multiple documents.

A Conceptual Example

Here is a simplified Python example showing the core RAG flow. Platform-specific implementations will be covered in subsequent articles in this series.

# Conceptual RAG pipeline - not production code
# Platform-specific implementations in later articles

from openai import OpenAI

client = OpenAI()

# === INDEXING (run once per document) ===

def create_embedding(text: str) -> list[float]:
    """Convert text to vector using embedding model."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def index_document(doc_id: str, content: str, vector_db):
    """Chunk, embed, and store a document."""
    # In production: use semantic chunking, not naive splitting
    chunks = split_into_chunks(content, chunk_size=500, overlap=50)

    for i, chunk in enumerate(chunks):
        embedding = create_embedding(chunk)
        vector_db.upsert(
            id=f"{doc_id}_{i}",
            vector=embedding,
            metadata={"doc_id": doc_id, "text": chunk}
        )

# === QUERY (run on every user request) ===

def rag_query(question: str, vector_db) -> str:
    """Answer a question using RAG."""

    # 1. Embed the query
    query_vector = create_embedding(question)

    # 2. Retrieve relevant chunks
    results = vector_db.query(
        vector=query_vector,
        top_k=5,
        include_metadata=True
    )

    # 3. Build context from retrieved chunks
    context = "\n\n---\n\n".join([
        result.metadata["text"] for result in results
    ])

    # 4. Generate response with context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant. Answer the user's
                question based on the provided context. If the context doesn't
                contain the answer, say so. Cite your sources."""
            },
            {
                "role": "user",
                "content": f"""Context:
{context}

Question: {question}

Answer based on the context above:"""
            }
        ]
    )

    return response.choices[0].message.content

This example omits many production concerns: error handling, hybrid search, reranking, caching, and more. The platform-specific articles in this series will build production-ready implementations.

The RAG Technology Landscape

One of the challenges in building RAG systems is the fragmented tooling landscape. There are dozens of frameworks, each with different philosophies, strengths, and trade-offs.

This series will provide hands-on implementations with six major platforms:

LangChain

The most widely adopted RAG framework. Python-first with a JavaScript port. Known for its extensive integrations (200+ document loaders, 50+ vector stores) and composable "chains" architecture. Best for teams who want maximum flexibility and do not mind some abstraction overhead.

LlamaIndex

Originally called "GPT Index." Deeply focused on document retrieval and indexing. Excellent for complex document structures (hierarchical, graph-based). Best for teams building knowledge bases from diverse document types.

Haystack

Enterprise-grade framework from deepset. Strong focus on production readiness, evaluation, and observability. Offers both OSS and managed cloud. Best for teams who need enterprise features out of the box.

Semantic Kernel

Microsoft's AI orchestration framework. First-class .NET support with Python available. Tight integration with Azure AI services. Best for teams in the Microsoft ecosystem building copilots.

AWS Bedrock Knowledge Bases

Fully managed RAG service. No infrastructure to manage: point it at S3 documents and query via API. Best for teams who want turnkey RAG without managing vector databases or embedding pipelines.

Vercel AI SDK

Streaming-first, built for Next.js and edge deployments. Not a full RAG framework, but provides excellent primitives for building RAG into web applications. Best for teams building AI-native web applications.

Quick Comparison

Platform	Primary Language	Managed Option	Key Strength
LangChain	Python/JS	LangSmith	Flexibility, integrations
LlamaIndex	Python	LlamaCloud	Document-centric indexing
Haystack	Python	deepset Cloud	Enterprise features
Semantic Kernel	C#/Python	Azure	Microsoft ecosystem
Bedrock KB	Any (API)	Fully managed	Zero infrastructure
Vercel AI SDK	TypeScript	Vercel	Web/streaming focus

Common Mistakes to Avoid

Having built RAG systems across industries, here are the mistakes we see most often:

Mistake 1: "Chunking is Just Splitting Text"

The default RecursiveCharacterTextSplitter(chunk_size=1000) in most tutorials is a starting point, not a solution. Naive chunking causes:

Context fragmentation: A sentence about "Q3 revenue was $10M" gets split, leaving you with "$10M" in one chunk and no idea what it refers to
Poor retrieval: Chunks that are too large dilute relevance; chunks too small lose context
Semantic breaks: Cutting mid-paragraph destroys meaning

The Fix: Use document-aware chunking. For contracts, chunk by clause. For code, chunk by function. For manuals, chunk by section header. Consider parent-child chunking where you embed small chunks but retrieve their larger parents.

Mistake 2: Ignoring Hybrid Search

Pure vector search has a critical weakness: it fails on exact matches.

Query: "Error code XJ-445"
Vector match: "Error code YK-112" (similar format, wrong code)

Technical content, product codes, proper nouns, and acronyms need keyword matching to work reliably.

The Fix: Implement hybrid search combining semantic vectors (for conceptual matching) with BM25 or sparse vectors (for exact matching). Use Reciprocal Rank Fusion to merge results. This is now table stakes for production RAG.

Mistake 3: Not Evaluating RAG Quality

Teams spend weeks building RAG systems, then ask "does it work?" by trying a few queries manually. This is insufficient.

RAG has multiple failure modes:

Retrieval failures: Right answer exists, wrong chunks returned
Context failures: Right chunks returned, LLM ignores them
Generation failures: LLM hallucinates despite good context

The Fix: Build an evaluation pipeline with metrics for each stage:

Retrieval: Hit rate, Mean Reciprocal Rank (MRR)
Generation: Faithfulness, answer relevance, groundedness

Tools like RAGAS, LangSmith, and Phoenix provide evaluation frameworks specifically for RAG.

Mistake 4: Security and Privacy Blind Spots

RAG introduces unique security considerations:

Prompt injection via documents: Malicious content in indexed documents can hijack responses
Data leakage: Sensitive documents retrieved for unauthorized users
Embedding exposure: Vector embeddings can potentially be inverted to recover original text

The Fix:

Sanitize documents before indexing
Implement retrieval-level access control (filter by user permissions before search)
Consider dedicated embedding models for sensitive content
Audit what content is being returned to users

What is Coming in This Series

This article establishes the foundations. The remaining eleven articles will provide hands-on, production-ready implementations:

Part	Title	Focus
2	LangChain RAG: From Prototype to Production	Complete LangChain implementation with LCEL
3	LlamaIndex: Document-Centric RAG	Hierarchical indexing, query engines
4	Haystack: Enterprise RAG Pipelines	Production patterns, evaluation
5	Semantic Kernel: RAG in the Microsoft Ecosystem	.NET implementation, Azure integration
6	AWS Bedrock Knowledge Bases	Fully managed RAG, zero infrastructure
7	Vercel AI SDK: Streaming RAG for Next.js	Edge deployment, real-time UX
8	Advanced Retrieval: Hybrid Search and Reranking	Cross-platform techniques
9	Chunking and Indexing Strategies	Document processing deep dive
10	RAG Evaluation and Testing	Metrics, benchmarks, CI/CD
11	Security and Access Control	Enterprise-grade patterns
12	Choosing Your Platform: A Decision Framework	Comparative analysis

Getting Started

Before diving into platform-specific implementations, ensure you have:

A use case: What questions should the system answer? What documents contain those answers?
Sample documents: Start with 10-50 documents to prototype before scaling
Evaluation queries: Write down 20+ questions with known answers for testing
API keys: Most implementations require OpenAI (for embeddings/LLM) and your chosen vector database

The next article, LangChain RAG: From Prototype to Production, will walk through a complete implementation using the most popular framework in the ecosystem.

This is Part 1 of the "Building RAG Systems: A Platform-by-Platform Guide" series. Next up: LangChain RAG: From Prototype to Production.