How to Build a RAG System: Complete Technical Guide (2026)

Most of the AI implementations we inherit from clients who tried to build it themselves have the same problem: they are using raw ChatGPT with a giant system prompt instead of a real knowledge retrieval system. The documents get truncated. Contradictory information co-exists in the same context window. The model interpolates between conflicting facts and produces confident-sounding nonsense.

RAG — Retrieval Augmented Generation — solves this. Despite the intimidating name, the concept is straightforward: instead of feeding all your documents to the LLM at once, you build a search engine that retrieves only the most relevant documents for each query, then feed those retrieved documents as context to the LLM.

For a law firm we worked with, a RAG system over 5,000 case documents answers 78% of attorney questions correctly. Setup cost: $2,400 (40 hours). Ongoing operational cost: $85/month (embeddings + LLM + Supabase). The alternative — two paralegals doing the same research — was $8,000/month.

What RAG Actually Is

Strip away the hype. RAG is a search engine that feeds its results to an LLM as context. That is the entire concept. The "retrieval" part is the search engine. The "augmented generation" part is the LLM answering using the search results as its source material.

What makes it powerful is that the search is semantic, not keyword-based. Instead of searching for exact word matches, you search for conceptual similarity. A query about "cancellation terms" will retrieve a document about "subscription termination policy" even if the words "cancel" and "terms" do not appear in the document.

RAG vs Fine-Tuning: The Case Is Not Even Close

Fine-tuning means training the model itself on your data. It is expensive ($1,000+ per training run on GPT-4), takes days, requires hundreds to thousands of high-quality training examples, and produces a model with knowledge frozen at training time. Update your policies? Retrain. Add new products? Retrain.

RAG is better for 95% of business use cases because:

Freshness: Update a document in your knowledge base and the system immediately uses the new information. No retraining.
Cost: Building and running a RAG system costs $50–$300/month. Fine-tuning costs $1,000–$10,000+ per training run.
Hallucination reduction: RAG grounds the model in your specific documents. The model answers from sources rather than from parametric memory, which dramatically reduces confident-but-wrong answers.
Auditability: You can show exactly which documents were retrieved for each answer. Fine-tuning is a black box.

Fine-tuning is appropriate when you need the model to learn a specific writing style, output format, or reasoning pattern — not when you want it to know specific facts. Facts belong in RAG. Style belongs in fine-tuning.

Key Takeaway

If your use case is "I want the AI to answer questions about my company's documents, policies, products, or data," that is a RAG use case. If it is "I want the AI to write in our brand voice consistently," that is a fine-tuning or system prompt use case. These are different problems.

The RAG Pipeline: Every Layer Explained

A production RAG system has six layers. Most tutorials cover only layers 3–5 and skip the critically important work in layers 1–2.

Layer 1: Document Ingestion

Getting your documents into the system is not a simple file upload. Different source types require different extraction approaches, and poor extraction is the most common source of RAG failures.

PDF Parsing

PDFs are the bane of knowledge base builders. Two options:

pdf-parse (Node.js): Free, works for text-based PDFs, returns raw text. Fails on scanned PDFs (image-based), tables, and multi-column layouts. Good for legal documents, reports, policy documents where text flows linearly.
Unstructured.io: $14/1,000 pages on their API. Handles scanned PDFs with OCR, preserves table structure, extracts text from multi-column layouts. Worth the cost for financial documents, forms, and anything with complex layout. We use this for all client PDF ingestion.

Web Scraping

For help centers, public documentation, and product pages, scrape with Puppeteer or Playwright and extract the main content. The critical step is removing navigation, footer, sidebar, and advertising content — these add noise that degrades retrieval. Use Mozilla's Readability library (the same thing Firefox uses for Reader Mode) to extract main content reliably.

CRM and SaaS Exports

Most CRMs export to CSV or JSON. Export customer records, case histories, or product data, then convert to structured text documents before ingestion. Format matters: "Customer name: John Smith. Issue: Login failure. Resolution: Reset password via admin panel." is much better for retrieval than a raw CSV row.

Layer 2: Chunking

Chunking is splitting your documents into the segments that will be stored and retrieved individually. This is the layer most implementations get wrong, and it is the one that has the biggest impact on retrieval quality.

Fixed-Size Chunking (512 or 1024 tokens)

The simplest approach. Split every document into equal-sized segments with a small overlap (50–100 tokens) between adjacent chunks. Easy to implement, predictable costs. The problem: it ignores document structure. A chunk might start mid-sentence or contain half an explanation that continues in the next chunk.

Semantic Chunking (by heading or paragraph)

Split at natural semantic boundaries: headings, section breaks, paragraph boundaries. This preserves logical units of information. Each chunk contains a complete thought. Harder to implement but significantly better for retrieval on documents with clear structure (documentation, policy docs, FAQs).

We tested both strategies on a 200-page employee handbook for an HR automation client. Semantic chunking improved answer accuracy by 23% over fixed-size chunking. The improvement was even larger (31%) for questions that required synthesizing information from a specific policy section.

Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter is a good middle ground. It attempts to split on semantic boundaries (paragraphs, then sentences, then words) but falls back to character-level splitting when sections are too long. We use this as the default for mixed content (some structured, some flowing prose).

Target Chunk Size

For customer service and Q&A applications: 200–400 tokens per chunk. Smaller chunks enable more precise retrieval. For research and analysis applications where context matters: 512–800 tokens. Larger chunks preserve more context around each fact.

Pro Tip: Always include overlap between consecutive chunks (50–100 tokens). Without overlap, information at the boundary between chunks can be split in a way that makes both adjacent chunks incomplete. With overlap, each chunk contains enough context to be understood independently.

Layer 3: Embeddings

An embedding is a numerical representation of text that captures its meaning. Similar texts have similar embeddings (close in vector space). This is what enables semantic search.

OpenAI text-embedding-3-small

Our default recommendation for most business RAG systems. At $0.02 per million tokens, embedding a 500-document knowledge base (average 500 tokens per document) costs approximately $0.005 — less than one cent. Strong performance across English language retrieval tasks. 1,536 dimensions. Use this unless you have a specific reason not to.

OpenAI text-embedding-3-large

At $0.13 per million tokens, this is 6.5x more expensive than text-embedding-3-small. Performance improvement is measurable but not dramatic for typical business use cases. We use this for: multilingual knowledge bases where the user may query in a different language than the source documents, or for highly technical domains (legal, medical) where subtle semantic distinctions matter. 3,072 dimensions.

Open-Source Embeddings (nomic-embed-text)

nomic-embed-text is free, runs locally or via Ollama, and performs comparably to text-embedding-3-small on many benchmarks. The trade-off is infrastructure overhead: you need to run a model server or pay for a hosted endpoint. For privacy-sensitive deployments where documents cannot leave your infrastructure, this is the right choice.

Layer 4: Vector Storage

Vector storage is where your embeddings live. You query it with a new embedding and get back the most similar stored embeddings (and their associated text chunks).

Supabase pgvector (Our Primary Recommendation)

pgvector is a PostgreSQL extension that adds vector similarity search. Supabase makes it trivially easy to set up. The free tier handles up to 500MB of data — enough for tens of thousands of chunks.

Setup SQL:

Enable: CREATE EXTENSION IF NOT EXISTS vector;
Create table: CREATE TABLE documents (id bigserial primary key, content text, embedding vector(1536), metadata jsonb, created_at timestamptz default now());
Create HNSW index for performance: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
Query: SELECT content, metadata, 1 - (embedding <=> $1) AS similarity FROM documents ORDER BY embedding <=> $1 LIMIT 5;

The HNSW index (Hierarchical Navigable Small World) dramatically speeds up similarity search on large datasets. On 100,000 chunks without an index, a query takes 2–5 seconds. With the HNSW index, the same query takes 20–50ms. Always add this index before going to production.

Pinecone ($70/month starter)

Managed vector database with excellent developer experience. The starter pod handles 100,000 vectors with fast query times. Worth considering when: you are already not using Supabase, you need the managed infrastructure guarantee, or your team is unfamiliar with PostgreSQL administration. We do not recommend it for clients who already have Supabase — the additional cost and complexity are unjustified.

Weaviate (Self-Hosted, Free)

Feature-rich vector database with hybrid search (vector + keyword) built in. More complex to set up and operate than pgvector. Recommend only when you specifically need hybrid search and do not want to implement it manually in pgvector.

Layer 5: Retrieval

Retrieval is where most RAG implementations have significant room for improvement beyond the basic similarity search.

Basic Similarity Search

Embed the query, find the 4–8 most similar chunks by cosine similarity. This is the baseline. It works reasonably well but misses cases where the right answer has a low similarity score despite being the best match.

Hybrid Search (Vector + Keyword)

Combine vector similarity with BM25 keyword search and merge the results. This handles cases where exact terminology matters (product names, technical codes, named entities). Implement in pgvector using the pg_trgm extension for text search combined with vector search. Use Reciprocal Rank Fusion to merge the two result sets.

Re-ranking with Cohere

After initial retrieval (get top 20 chunks), use Cohere's Rerank API ($1 per 1,000 requests) to re-score and re-order results based on relevance to the specific query. Re-ranking consistently improves answer quality by 15–25% in our evaluations. We use it in all production deployments that justify the $5–$30/month additional cost.

Metadata Filtering

Before or during retrieval, filter by document metadata. If a user asks about the return policy and your knowledge base covers multiple product categories, filter to only retrieve from documents tagged with the relevant category. This eliminates irrelevant results that might dilute the context.

In Supabase: add a WHERE clause to your similarity search query. WHERE metadata->>'category' = 'returns'. This requires storing category metadata during ingestion — another reason the metadata column in your documents table is essential.

Layer 6: Generation

The final layer: using the retrieved chunks as context to generate the answer.

System Prompt Structure for RAG

The system prompt for a RAG application differs from a general ChatGPT prompt. It must:

Instruct the model to answer ONLY from the provided context
Tell the model how to handle cases where context does not contain the answer
Specify citation format (if you want the model to reference source documents)
Define the response format and length constraints

Template: "You are [role] for [company]. Answer questions using ONLY the provided context documents. If the answer is not in the provided context, say: 'I don't have specific information about that in my knowledge base. Please [action].' Do not make up information. When relevant, cite the source document title."

Confidence Scoring

Implement a pre-response confidence check: if the highest similarity score from retrieval is below 0.70 (on a 0–1 cosine similarity scale), treat this as a low-confidence retrieval and route to your fallback behavior instead of attempting to answer. This prevents the model from fabricating answers when your knowledge base genuinely does not contain relevant information.

Citation Formatting

For professional applications (legal, medical, compliance), include source citations in responses. Store the document title, source URL, and section header in your metadata. Pass these to the model with instructions to cite specific sources when making factual claims. This builds trust and enables verification.

Advanced Patterns

Parent-Child Chunking

Store documents in two ways simultaneously: small chunks (200 tokens) for precise retrieval, and large parent chunks (1,000 tokens) for response generation. When a small chunk is retrieved, look up its parent chunk and use the larger context for the LLM. This gives you retrieval precision with generation context — the best of both chunk sizes.

Multi-Index RAG

Maintain separate vector indexes for different document types. For a law firm: one index for case law, one for contracts, one for internal procedures. Route the query to the appropriate index based on intent classification before retrieval. This prevents irrelevant cross-category contamination in results.

Hypothetical Document Embedding (HyDE)

Before retrieving, generate a hypothetical ideal answer to the query using the LLM, then use that hypothetical answer as the query for retrieval instead of the original question. This bridges the vocabulary gap between short questions and long document chunks. Particularly effective for technical domains. The trade-off is an additional LLM call per query, adding $0.001–$0.003 per query.

Evaluating RAG Quality

You cannot improve what you do not measure. These three metrics form the minimum evaluation framework:

Answer Relevance: Does the generated answer address the question? Score 0–5. Evaluate by having a human (or a separate LLM call) judge whether the answer is responsive to the question.
Faithfulness: Are all claims in the answer supported by the retrieved context? Score 0–5. Check for hallucinated facts not present in retrieved chunks.
Context Recall: Were the right documents retrieved? If you have ground-truth Q&A pairs, check whether the correct source document appears in the top-5 retrieved results.

The RAGAS framework (available on GitHub and PyPI) automates this evaluation using an LLM-as-judge approach. Run RAGAS on a test set of 50–100 representative queries before deploying to production. We use this to validate every knowledge base we build.

Building Your Test Set

Create your evaluation test set by taking 50 real questions from your support history, sales calls, or customer feedback — questions users actually ask, not questions you think they will ask. Pair each with the correct answer and the correct source document. This becomes your ground truth for measuring retrieval quality.

Production Considerations

Document Update Pipeline

RAG systems require ongoing maintenance as your documents change. Build an update pipeline that: detects document changes (webhook from your CMS, scheduled crawl, or file modification timestamps), re-chunks the changed document, re-embeds the new chunks, and upserts to the vector store using the source URL as a unique key to replace old chunks.

Embedding Cost Optimization

Batch embedding calls: instead of embedding one chunk at a time, batch 100 chunks per API call. This uses OpenAI's batch endpoint which processes async requests at 50% cost. For a 10,000-chunk knowledge base, batching saves $0.01 vs. $0.02 on initial embedding — small savings, but they matter at scale with frequent updates.

Caching

Cache embedding results for common queries. If 30% of your queries are variations of the same 10 questions, caching those embeddings and their retrieval results eliminates repeated computation. Use Redis with a 1-hour TTL for dynamic content or 24-hour TTL for stable knowledge bases.

Common RAG Failure Modes

Chunk size too large. 2,000-token chunks containing multiple topics contaminate each other during retrieval. The model receives context about Topic A when the user asked about Topic B, and the answer blends both incorrectly.
No metadata filtering. Without filtering, a query about "pricing" retrieves pricing information from the wrong product line or the wrong time period. Every knowledge base needs document-level metadata.
No re-ranking. Basic similarity search returns results ordered by vector similarity, not answer relevance. A chunk that is vectorially similar but not actually useful gets ranked above a chunk that is slightly less similar but directly answers the question. Re-ranking fixes this.
Outdated documents in the knowledge base. Old pricing, superseded policies, and discontinued products all produce incorrect answers that are impossible to detect without source citations. Build the update pipeline before you need it, not after your first "why did the AI tell a customer our 2024 price?" incident.
No evaluation before deployment. Deploying a RAG system without running RAGAS or equivalent evaluation means you do not know your baseline quality. You cannot measure improvement or detect regression. Always evaluate on a held-out test set before and after changes.

Key Takeaway

The quality of a RAG system is 70% knowledge base quality and 30% retrieval architecture. Before optimizing your embedding model or trying advanced retrieval patterns, ensure your source documents are clean, current, and well-structured. The best vector search in the world cannot rescue a knowledge base full of outdated, contradictory, or poorly formatted documents.

Building Your First RAG System

Start with Supabase pgvector and text-embedding-3-small. For document ingestion, begin with your top 50 support articles or FAQ pages — clean them manually, chunk semantically, and embed. Build the retrieval function. Write the generation prompt. Test with 20 real questions before anything else.

The first version will not be perfect. The chunking will be imperfect, some retrievals will miss, and the generation prompt will need iteration. That is normal. Run RAGAS evaluation, identify the failure modes, and fix the highest-impact issues first.

For context on how RAG fits into broader AI integration strategies, read our RAG explained: how to connect AI to your business data. For implementation as part of a managed deployment, see our AI integration services.

What RAG Actually Is

RAG vs Fine-Tuning: The Case Is Not Even Close

RAG is better for 95% of business use cases because:

Freshness: Update a document in your knowledge base and the system immediately uses the new information. No retraining.
Cost: Building and running a RAG system costs $50–$300/month. Fine-tuning costs $1,000–$10,000+ per training run.
Hallucination reduction: RAG grounds the model in your specific documents. The model answers from sources rather than from parametric memory, which dramatically reduces confident-but-wrong answers.
Auditability: You can show exactly which documents were retrieved for each answer. Fine-tuning is a black box.

Key Takeaway

The RAG Pipeline: Every Layer Explained

A production RAG system has six layers. Most tutorials cover only layers 3–5 and skip the critically important work in layers 1–2.

Layer 1: Document Ingestion

Getting your documents into the system is not a simple file upload. Different source types require different extraction approaches, and poor extraction is the most common source of RAG failures.

PDF Parsing

PDFs are the bane of knowledge base builders. Two options:

pdf-parse (Node.js): Free, works for text-based PDFs, returns raw text. Fails on scanned PDFs (image-based), tables, and multi-column layouts. Good for legal documents, reports, policy documents where text flows linearly.
Unstructured.io: $14/1,000 pages on their API. Handles scanned PDFs with OCR, preserves table structure, extracts text from multi-column layouts. Worth the cost for financial documents, forms, and anything with complex layout. We use this for all client PDF ingestion.

Web Scraping

CRM and SaaS Exports

Layer 2: Chunking

Fixed-Size Chunking (512 or 1024 tokens)

Semantic Chunking (by heading or paragraph)

Recursive Character Splitting

Target Chunk Size

Layer 3: Embeddings

An embedding is a numerical representation of text that captures its meaning. Similar texts have similar embeddings (close in vector space). This is what enables semantic search.

OpenAI text-embedding-3-small

OpenAI text-embedding-3-large

Open-Source Embeddings (nomic-embed-text)

Layer 4: Vector Storage

Vector storage is where your embeddings live. You query it with a new embedding and get back the most similar stored embeddings (and their associated text chunks).

Supabase pgvector (Our Primary Recommendation)

Setup SQL:

Enable: CREATE EXTENSION IF NOT EXISTS vector;
Create table: CREATE TABLE documents (id bigserial primary key, content text, embedding vector(1536), metadata jsonb, created_at timestamptz default now());
Create HNSW index for performance: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
Query: SELECT content, metadata, 1 - (embedding <=> $1) AS similarity FROM documents ORDER BY embedding <=> $1 LIMIT 5;

Pinecone ($70/month starter)

Weaviate (Self-Hosted, Free)

Layer 5: Retrieval

Retrieval is where most RAG implementations have significant room for improvement beyond the basic similarity search.

Basic Similarity Search

Hybrid Search (Vector + Keyword)

Re-ranking with Cohere

Metadata Filtering

Layer 6: Generation

The final layer: using the retrieved chunks as context to generate the answer.

System Prompt Structure for RAG

The system prompt for a RAG application differs from a general ChatGPT prompt. It must:

Instruct the model to answer ONLY from the provided context
Tell the model how to handle cases where context does not contain the answer
Specify citation format (if you want the model to reference source documents)
Define the response format and length constraints

Confidence Scoring

Citation Formatting

Advanced Patterns

Parent-Child Chunking

Multi-Index RAG

Hypothetical Document Embedding (HyDE)

Evaluating RAG Quality

You cannot improve what you do not measure. These three metrics form the minimum evaluation framework:

Answer Relevance: Does the generated answer address the question? Score 0–5. Evaluate by having a human (or a separate LLM call) judge whether the answer is responsive to the question.
Faithfulness: Are all claims in the answer supported by the retrieved context? Score 0–5. Check for hallucinated facts not present in retrieved chunks.
Context Recall: Were the right documents retrieved? If you have ground-truth Q&A pairs, check whether the correct source document appears in the top-5 retrieved results.

Building Your Test Set

Production Considerations

Document Update Pipeline

Embedding Cost Optimization

Caching

Common RAG Failure Modes

Chunk size too large. 2,000-token chunks containing multiple topics contaminate each other during retrieval. The model receives context about Topic A when the user asked about Topic B, and the answer blends both incorrectly.
No metadata filtering. Without filtering, a query about "pricing" retrieves pricing information from the wrong product line or the wrong time period. Every knowledge base needs document-level metadata.
No re-ranking. Basic similarity search returns results ordered by vector similarity, not answer relevance. A chunk that is vectorially similar but not actually useful gets ranked above a chunk that is slightly less similar but directly answers the question. Re-ranking fixes this.
Outdated documents in the knowledge base. Old pricing, superseded policies, and discontinued products all produce incorrect answers that are impossible to detect without source citations. Build the update pipeline before you need it, not after your first "why did the AI tell a customer our 2024 price?" incident.
No evaluation before deployment. Deploying a RAG system without running RAGAS or equivalent evaluation means you do not know your baseline quality. You cannot measure improvement or detect regression. Always evaluate on a held-out test set before and after changes.

Key Takeaway

What RAG Actually Is

RAG vs Fine-Tuning: The Case Is Not Even Close

The RAG Pipeline: Every Layer Explained

Layer 1: Document Ingestion

PDF Parsing

Web Scraping

CRM and SaaS Exports

Layer 2: Chunking

Fixed-Size Chunking (512 or 1024 tokens)

Semantic Chunking (by heading or paragraph)

Recursive Character Splitting

Target Chunk Size

Layer 3: Embeddings

OpenAI text-embedding-3-small

OpenAI text-embedding-3-large

Open-Source Embeddings (nomic-embed-text)

Layer 4: Vector Storage

Supabase pgvector (Our Primary Recommendation)

Pinecone ($70/month starter)

Weaviate (Self-Hosted, Free)

Layer 5: Retrieval

Basic Similarity Search

Hybrid Search (Vector + Keyword)

Re-ranking with Cohere

Metadata Filtering

Layer 6: Generation

System Prompt Structure for RAG

Confidence Scoring

Citation Formatting

Advanced Patterns

Parent-Child Chunking

Multi-Index RAG

Hypothetical Document Embedding (HyDE)

Evaluating RAG Quality

Production Considerations

Document Update Pipeline

Embedding Cost Optimization

Caching

Common RAG Failure Modes

Building Your First RAG System

Related Articles

RAG Explained: How to Connect AI to Your Business Data

How to Build an AI Chatbot for Your Website (4 Approaches

AI Customer Service: Implementation Guide

Explore Our Services

Custom Web Design

Make AI Your Edge.

What RAG Actually Is

RAG vs Fine-Tuning: The Case Is Not Even Close

The RAG Pipeline: Every Layer Explained

Layer 1: Document Ingestion

PDF Parsing

Web Scraping

CRM and SaaS Exports

Layer 2: Chunking

Fixed-Size Chunking (512 or 1024 tokens)

Semantic Chunking (by heading or paragraph)

Recursive Character Splitting

Target Chunk Size

Layer 3: Embeddings

OpenAI text-embedding-3-small

OpenAI text-embedding-3-large

Open-Source Embeddings (nomic-embed-text)

Layer 4: Vector Storage

Supabase pgvector (Our Primary Recommendation)

Pinecone ($70/month starter)

Weaviate (Self-Hosted, Free)

Layer 5: Retrieval

Basic Similarity Search

Hybrid Search (Vector + Keyword)

Re-ranking with Cohere

Metadata Filtering

Layer 6: Generation

System Prompt Structure for RAG

Confidence Scoring

Citation Formatting

Advanced Patterns

Parent-Child Chunking

Multi-Index RAG

Hypothetical Document Embedding (HyDE)