π€ Ghostwritten by Claude Opus 4.5 Β· Curated by Tom Hundley
This article was written by Claude Opus 4.5 and curated for publication by Tom Hundley.
Your LLM knows a lot. But it does not know your data.
It is December 2025. You have access to models with 200,000+ token context windows. Claude can read entire codebases. GPT-4 Turbo can process books. Gemini 2.0 handles hour-long videos.
And yet, your enterprise chatbot still hallucinates your company's vacation policy.
Here is the fundamental tension: Large Language Models are trained on internet-scale data, frozen at a knowledge cutoff date, and have no awareness of your internal documents, databases, or proprietary knowledge. No matter how large the context window grows, you cannot paste your entire data warehouse into a prompt.
This is where Retrieval-Augmented Generation comes in. RAG is not just another AI buzzword. It is the architecture pattern that bridges the gap between what LLMs know and what your business needs them to know.
The simplest mental model for RAG is the Open Book Exam.
Traditional LLMs take a Closed Book Exam. They answer questions using only what they memorized during training. If the answer was not in the training data, or if the information has changed since the knowledge cutoff, the model either confesses ignorance or hallucinates confidently.
RAG transforms this into an Open Book Exam. Before the model answers, it retrieves relevant documents from your knowledge base, reads them, and then generates a response grounded in that retrieved context.
The result: An LLM that can answer questions about your Q4 2025 sales figures, your company's expense reimbursement policy, or your product's latest API documentation, even though none of that existed when the model was trained.
RAG consists of three stages that happen on every query:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β β β β β β β
β β RETRIEVAL βββββΆβ AUGMENTATION βββββΆβ GENERATION β β
β β β β β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β Find relevant Inject retrieved LLM generates β
β documents from content into response based β
β vector database the prompt on provided context β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ1. Retrieval: Given a user query, find the most relevant documents or passages from your knowledge base. This typically uses semantic search powered by vector embeddings.
2. Augmentation: Take the retrieved documents and inject them into the LLM's prompt as context. This is the "augmentation" that gives RAG its name.
3. Generation: The LLM generates a response, using both its pretrained knowledge and the retrieved context to answer the user's question.
RAG is powerful, but it is not always the right answer. Here is a decision framework for choosing between RAG and its alternatives.
This is the most common architectural decision, and it comes down to a simple distinction:
| Criterion | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Instant (update the document) | Slow (retrain the model) |
| Citations/Sources | Can cite specific documents | Cannot attribute sources |
| Access control | Filter retrieval by permissions | Knowledge is baked in |
| Cost per query | Higher (retrieval + LLM) | Lower (single inference) |
| Setup complexity | Vector DB infrastructure | Training pipeline |
| Best for | Dynamic, factual knowledge | Style, format, reasoning |
The Rule: If you need the model to know facts, use RAG. If you need the model to behave differently, consider fine-tuning. For a deeper exploration, see our article on RAG vs Fine-tuning.
With 200,000+ token context windows now available, why not just paste all your documents into the prompt?
| Criterion | RAG | Long Context |
|---|---|---|
| Document volume | Unlimited (stored externally) | Limited by context window |
| Cost | Retrieval is cheap; only relevant chunks go to LLM | Tokens are expensive; paying for irrelevant content |
| Latency | Fast (retrieve k documents) | Slow for very long contexts |
| Precision | High (semantic matching) | Model must "find the needle" |
| Knowledge updates | Instant | Must rebuild prompt |
The Nuance: Long context windows are excellent for documents you know the user will need, like "summarize this contract" or "answer questions about this codebase." RAG is better when the relevant documents are unknown upfront and must be discovered from a large corpus.
The two approaches are also complementary. Many production systems use RAG to retrieve documents, then pass them to a long-context model for deep analysis.
Traditional keyword search (BM25, Elasticsearch) still works. Why add the complexity of vectors?
| Criterion | RAG (Semantic) | Traditional Search |
|---|---|---|
| Query understanding | Conceptual matching | Exact keyword matching |
| Synonym handling | Automatic | Manual (synonyms list) |
| Answer generation | Synthesized, natural language | Document links/snippets |
| Setup complexity | Higher (embeddings, vector DB) | Lower (inverted index) |
The Truth: Production RAG systems use both. Hybrid search combining semantic vectors with keyword matching typically outperforms either approach alone, especially for technical content with exact-match requirements. We cover this in depth in our Hybrid Search article.
| Your Situation | Recommended Approach |
|---|---|
| Need to answer questions from dynamic documents | RAG |
| Need model to write in specific style/format | Fine-tuning |
| User provides the document at query time | Long Context |
| Searching for specific documents by keyword | Traditional Search |
| Large corpus, unknown relevance | RAG + Hybrid Search |
| High accuracy on technical content | RAG + Reranking |
A RAG system has two major pipelines: Indexing (offline, runs when documents change) and Query (online, runs on every user request).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INDEXING PIPELINE (OFFLINE) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββββββ β
β β Document β β β β Embeddingβ β Vector β β
β β Loading ββββΆβ Chunking ββββΆβ Model ββββΆβ Database β β
β β β β β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β PDFs, docs, Split into Convert to Store vectors β
β databases, semantic numerical + metadata β
β APIs units vectors for retrieval β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ1. Document Loading: Ingest documents from their source format. This includes PDFs, Word documents, HTML pages, Notion databases, Confluence spaces, code repositories, and more. Each source requires specialized parsing.
2. Chunking: Split documents into smaller pieces that fit within embedding model context limits and provide focused, retrievable units. This is far more nuanced than it sounds. See our Chunking Strategies article for why "just split every 1000 characters" fails.
3. Embedding: Convert each chunk into a vector (a list of floating-point numbers) using an embedding model. These vectors capture semantic meaning, allowing "vacation policy" to match documents about "PTO" and "time off." Our Embeddings Deep Dive covers model selection.
4. Vector Storage: Store the vectors in a database optimized for similarity search. Options range from managed services (Pinecone) to open source (Weaviate, Qdrant) to PostgreSQL extensions (pgvector). See our Vector Database Comparison.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY PIPELINE (ONLINE) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β β User β β Query β β Vector β β Context β β LLM β β
β β Query ββββΆβ EmbeddingββββΆβ Search ββββΆβ Building ββββΆβ Generateβ β
β β β β β β β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β "What is Same model Find top-k Build prompt Generate β
β our PTO as indexing similar with system + grounded β
β policy?" chunks context + query answer β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ1. Query Embedding: The user's question is converted to a vector using the same embedding model used during indexing. This is critical: you cannot mix vectors from different models.
2. Vector Search: The query vector is compared against all stored vectors to find the most similar chunks. This uses algorithms like HNSW (Hierarchical Navigable Small Worlds) for fast approximate nearest neighbor search.
3. Context Building: The retrieved chunks are assembled into a prompt, typically with a system message explaining the task, the retrieved context, and the user's original question.
4. LLM Generation: The augmented prompt is sent to an LLM, which generates a response grounded in the retrieved context. The model can cite sources, quote passages, or synthesize across multiple documents.
Here is a simplified Python example showing the core RAG flow. Platform-specific implementations will be covered in subsequent articles in this series.
# Conceptual RAG pipeline - not production code
# Platform-specific implementations in later articles
from openai import OpenAI
client = OpenAI()
# === INDEXING (run once per document) ===
def create_embedding(text: str) -> list[float]:
"""Convert text to vector using embedding model."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def index_document(doc_id: str, content: str, vector_db):
"""Chunk, embed, and store a document."""
# In production: use semantic chunking, not naive splitting
chunks = split_into_chunks(content, chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
embedding = create_embedding(chunk)
vector_db.upsert(
id=f"{doc_id}_{i}",
vector=embedding,
metadata={"doc_id": doc_id, "text": chunk}
)
# === QUERY (run on every user request) ===
def rag_query(question: str, vector_db) -> str:
"""Answer a question using RAG."""
# 1. Embed the query
query_vector = create_embedding(question)
# 2. Retrieve relevant chunks
results = vector_db.query(
vector=query_vector,
top_k=5,
include_metadata=True
)
# 3. Build context from retrieved chunks
context = "\n\n---\n\n".join([
result.metadata["text"] for result in results
])
# 4. Generate response with context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a helpful assistant. Answer the user's
question based on the provided context. If the context doesn't
contain the answer, say so. Cite your sources."""
},
{
"role": "user",
"content": f"""Context:
{context}
Question: {question}
Answer based on the context above:"""
}
]
)
return response.choices[0].message.contentThis example omits many production concerns: error handling, hybrid search, reranking, caching, and more. The platform-specific articles in this series will build production-ready implementations.
One of the challenges in building RAG systems is the fragmented tooling landscape. There are dozens of frameworks, each with different philosophies, strengths, and trade-offs.
This series will provide hands-on implementations with six major platforms:
The most widely adopted RAG framework. Python-first with a JavaScript port. Known for its extensive integrations (200+ document loaders, 50+ vector stores) and composable "chains" architecture. Best for teams who want maximum flexibility and do not mind some abstraction overhead.
Originally called "GPT Index." Deeply focused on document retrieval and indexing. Excellent for complex document structures (hierarchical, graph-based). Best for teams building knowledge bases from diverse document types.
Enterprise-grade framework from deepset. Strong focus on production readiness, evaluation, and observability. Offers both OSS and managed cloud. Best for teams who need enterprise features out of the box.
Microsoft's AI orchestration framework. First-class .NET support with Python available. Tight integration with Azure AI services. Best for teams in the Microsoft ecosystem building copilots.
Fully managed RAG service. No infrastructure to manage: point it at S3 documents and query via API. Best for teams who want turnkey RAG without managing vector databases or embedding pipelines.
Streaming-first, built for Next.js and edge deployments. Not a full RAG framework, but provides excellent primitives for building RAG into web applications. Best for teams building AI-native web applications.
| Platform | Primary Language | Managed Option | Key Strength |
|---|---|---|---|
| LangChain | Python/JS | LangSmith | Flexibility, integrations |
| LlamaIndex | Python | LlamaCloud | Document-centric indexing |
| Haystack | Python | deepset Cloud | Enterprise features |
| Semantic Kernel | C#/Python | Azure | Microsoft ecosystem |
| Bedrock KB | Any (API) | Fully managed | Zero infrastructure |
| Vercel AI SDK | TypeScript | Vercel | Web/streaming focus |
Having built RAG systems across industries, here are the mistakes we see most often:
The default RecursiveCharacterTextSplitter(chunk_size=1000) in most tutorials is a starting point, not a solution. Naive chunking causes:
The Fix: Use document-aware chunking. For contracts, chunk by clause. For code, chunk by function. For manuals, chunk by section header. Consider parent-child chunking where you embed small chunks but retrieve their larger parents.
Pure vector search has a critical weakness: it fails on exact matches.
Query: "Error code XJ-445"
Vector match: "Error code YK-112" (similar format, wrong code)
Technical content, product codes, proper nouns, and acronyms need keyword matching to work reliably.
The Fix: Implement hybrid search combining semantic vectors (for conceptual matching) with BM25 or sparse vectors (for exact matching). Use Reciprocal Rank Fusion to merge results. This is now table stakes for production RAG.
Teams spend weeks building RAG systems, then ask "does it work?" by trying a few queries manually. This is insufficient.
RAG has multiple failure modes:
The Fix: Build an evaluation pipeline with metrics for each stage:
Tools like RAGAS, LangSmith, and Phoenix provide evaluation frameworks specifically for RAG.
RAG introduces unique security considerations:
The Fix:
This article establishes the foundations. The remaining eleven articles will provide hands-on, production-ready implementations:
| Part | Title | Focus |
|---|---|---|
| 2 | LangChain RAG: From Prototype to Production | Complete LangChain implementation with LCEL |
| 3 | LlamaIndex: Document-Centric RAG | Hierarchical indexing, query engines |
| 4 | Haystack: Enterprise RAG Pipelines | Production patterns, evaluation |
| 5 | Semantic Kernel: RAG in the Microsoft Ecosystem | .NET implementation, Azure integration |
| 6 | AWS Bedrock Knowledge Bases | Fully managed RAG, zero infrastructure |
| 7 | Vercel AI SDK: Streaming RAG for Next.js | Edge deployment, real-time UX |
| 8 | Advanced Retrieval: Hybrid Search and Reranking | Cross-platform techniques |
| 9 | Chunking and Indexing Strategies | Document processing deep dive |
| 10 | RAG Evaluation and Testing | Metrics, benchmarks, CI/CD |
| 11 | Security and Access Control | Enterprise-grade patterns |
| 12 | Choosing Your Platform: A Decision Framework | Comparative analysis |
If you are new to RAG: Read this article, then proceed sequentially through the platform you are most interested in (Parts 2-7).
If you have a specific platform in mind: Jump directly to that platform's article after reading this foundation.
If you are optimizing an existing system: Skip to Parts 8-11 for advanced patterns.
This series builds on our existing RAG content. For deeper dives into specific concepts, see:
Before diving into platform-specific implementations, ensure you have:
The next article, LangChain RAG: From Prototype to Production, will walk through a complete implementation using the most popular framework in the ecosystem.
This is Part 1 of the "Building RAG Systems: A Platform-by-Platform Guide" series. Next up: LangChain RAG: From Prototype to Production.
Discover more content: