Most of the AI implementations we inherit from clients who tried to build it themselves have the same problem: they are using raw ChatGPT with a giant system prompt instead of a real knowledge retrieval system. The documents get truncated. Contradictory information co-exists in the same context window. The model interpolates between conflicting facts and produces confident-sounding nonsense.
RAG — Retrieval Augmented Generation — solves this. Despite the intimidating name, the concept is straightforward: instead of feeding all your documents to the LLM at once, you build a search engine that retrieves only the most relevant documents for each query, then feed those retrieved documents as context to the LLM.
For a law firm we worked with, a RAG system over 5,000 case documents answers 78% of attorney questions correctly. Setup cost: $2,400 (40 hours). Ongoing operational cost: $85/month (embeddings + LLM + Supabase). The alternative — two paralegals doing the same research — was $8,000/month.
What RAG Actually Is
Strip away the hype. RAG is a search engine that feeds its results to an LLM as context. That is the entire concept. The "retrieval" part is the search engine. The "augmented generation" part is the LLM answering using the search results as its source material.
What makes it powerful is that the search is semantic, not keyword-based. Instead of searching for exact word matches, you search for conceptual similarity. A query about "cancellation terms" will retrieve a document about "subscription termination policy" even if the words "cancel" and "terms" do not appear in the document.
RAG vs Fine-Tuning: The Case Is Not Even Close
Fine-tuning means training the model itself on your data. It is expensive ($1,000+ per training run on GPT-4), takes days, requires hundreds to thousands of high-quality training examples, and produces a model with knowledge frozen at training time. Update your policies? Retrain. Add new products? Retrain.
RAG is better for 95% of business use cases because:
- Freshness: Update a document in your knowledge base and the system immediately uses the new information. No retraining.
- Cost: Building and running a RAG system costs $50–$300/month. Fine-tuning costs $1,000–$10,000+ per training run.
- Hallucination reduction: RAG grounds the model in your specific documents. The model answers from sources rather than from parametric memory, which dramatically reduces confident-but-wrong answers.
- Auditability: You can show exactly which documents were retrieved for each answer. Fine-tuning is a black box.
Fine-tuning is appropriate when you need the model to learn a specific writing style, output format, or reasoning pattern — not when you want it to know specific facts. Facts belong in RAG. Style belongs in fine-tuning.
The RAG Pipeline: Every Layer Explained
A production RAG system has six layers. Most tutorials cover only layers 3–5 and skip the critically important work in layers 1–2.
Layer 1: Document Ingestion
Getting your documents into the system is not a simple file upload. Different source types require different extraction approaches, and poor extraction is the most common source of RAG failures.
PDF Parsing
PDFs are the bane of knowledge base builders. Two options:
- pdf-parse (Node.js): Free, works for text-based PDFs, returns raw text. Fails on scanned PDFs (image-based), tables, and multi-column layouts. Good for legal documents, reports, policy documents where text flows linearly.
- Unstructured.io: $14/1,000 pages on their API. Handles scanned PDFs with OCR, preserves table structure, extracts text from multi-column layouts. Worth the cost for financial documents, forms, and anything with complex layout. We use this for all client PDF ingestion.
Web Scraping
For help centers, public documentation, and product pages, scrape with Puppeteer or Playwright and extract the main content. The critical step is removing navigation, footer, sidebar, and advertising content — these add noise that degrades retrieval. Use Mozilla's Readability library (the same thing Firefox uses for Reader Mode) to extract main content reliably.
CRM and SaaS Exports
Most CRMs export to CSV or JSON. Export customer records, case histories, or product data, then convert to structured text documents before ingestion. Format matters: "Customer name: John Smith. Issue: Login failure. Resolution: Reset password via admin panel." is much better for retrieval than a raw CSV row.
Layer 2: Chunking
Chunking is splitting your documents into the segments that will be stored and retrieved individually. This is the layer most implementations get wrong, and it is the one that has the biggest impact on retrieval quality.
Fixed-Size Chunking (512 or 1024 tokens)
The simplest approach. Split every document into equal-sized segments with a small overlap (50–100 tokens) between adjacent chunks. Easy to implement, predictable costs. The problem: it ignores document structure. A chunk might start mid-sentence or contain half an explanation that continues in the next chunk.
Semantic Chunking (by heading or paragraph)
Split at natural semantic boundaries: headings, section breaks, paragraph boundaries. This preserves logical units of information. Each chunk contains a complete thought. Harder to implement but significantly better for retrieval on documents with clear structure (documentation, policy docs, FAQs).
We tested both strategies on a 200-page employee handbook for an HR automation client. Semantic chunking improved answer accuracy by 23% over fixed-size chunking. The improvement was even larger (31%) for questions that required synthesizing information from a specific policy section.
Recursive Character Splitting
LangChain's RecursiveCharacterTextSplitter is a good middle ground. It attempts to split on semantic boundaries (paragraphs, then sentences, then words) but falls back to character-level splitting when sections are too long. We use this as the default for mixed content (some structured, some flowing prose).
Target Chunk Size
For customer service and Q&A applications: 200–400 tokens per chunk. Smaller chunks enable more precise retrieval. For research and analysis applications where context matters: 512–800 tokens. Larger chunks preserve more context around each fact.
Layer 3: Embeddings
An embedding is a numerical representation of text that captures its meaning. Similar texts have similar embeddings (close in vector space). This is what enables semantic search.
OpenAI text-embedding-3-small
Our default recommendation for most business RAG systems. At $0.02 per million tokens, embedding a 500-document knowledge base (average 500 tokens per document) costs approximately $0.005 — less than one cent. Strong performance across English language retrieval tasks. 1,536 dimensions. Use this unless you have a specific reason not to.
OpenAI text-embedding-3-large
At $0.13 per million tokens, this is 6.5x more expensive than text-embedding-3-small. Performance improvement is measurable but not dramatic for typical business use cases. We use this for: multilingual knowledge bases where the user may query in a different language than the source documents, or for highly technical domains (legal, medical) where subtle semantic distinctions matter. 3,072 dimensions.
Open-Source Embeddings (nomic-embed-text)
nomic-embed-text is free, runs locally or via Ollama, and performs comparably to text-embedding-3-small on many benchmarks. The trade-off is infrastructure overhead: you need to run a model server or pay for a hosted endpoint. For privacy-sensitive deployments where documents cannot leave your infrastructure, this is the right choice.
Layer 4: Vector Storage
Vector storage is where your embeddings live. You query it with a new embedding and get back the most similar stored embeddings (and their associated text chunks).
Supabase pgvector (Our Primary Recommendation)
pgvector is a PostgreSQL extension that adds vector similarity search. Supabase makes it trivially easy to set up. The free tier handles up to 500MB of data — enough for tens of thousands of chunks.
Setup SQL:
- Enable: CREATE EXTENSION IF NOT EXISTS vector;
- Create table: CREATE TABLE documents (id bigserial primary key, content text, embedding vector(1536), metadata jsonb, created_at timestamptz default now());
- Create HNSW index for performance: CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
- Query: SELECT content, metadata, 1 - (embedding <=> $1) AS similarity FROM documents ORDER BY embedding <=> $1 LIMIT 5;
The HNSW index (Hierarchical Navigable Small World) dramatically speeds up similarity search on large datasets. On 100,000 chunks without an index, a query takes 2–5 seconds. With the HNSW index, the same query takes 20–50ms. Always add this index before going to production.
Pinecone ($70/month starter)
Managed vector database with excellent developer experience. The starter pod handles 100,000 vectors with fast query times. Worth considering when: you are already not using Supabase, you need the managed infrastructure guarantee, or your team is unfamiliar with PostgreSQL administration. We do not recommend it for clients who already have Supabase — the additional cost and complexity are unjustified.
Weaviate (Self-Hosted, Free)
Feature-rich vector database with hybrid search (vector + keyword) built in. More complex to set up and operate than pgvector. Recommend only when you specifically need hybrid search and do not want to implement it manually in pgvector.
Layer 5: Retrieval
Retrieval is where most RAG implementations have significant room for improvement beyond the basic similarity search.
Basic Similarity Search
Embed the query, find the 4–8 most similar chunks by cosine similarity. This is the baseline. It works reasonably well but misses cases where the right answer has a low similarity score despite being the best match.
Hybrid Search (Vector + Keyword)
Combine vector similarity with BM25 keyword search and merge the results. This handles cases where exact terminology matters (product names, technical codes, named entities). Implement in pgvector using the pg_trgm extension for text search combined with vector search. Use Reciprocal Rank Fusion to merge the two result sets.
Re-ranking with Cohere
After initial retrieval (get top 20 chunks), use Cohere's Rerank API ($1 per 1,000 requests) to re-score and re-order results based on relevance to the specific query. Re-ranking consistently improves answer quality by 15–25% in our evaluations. We use it in all production deployments that justify the $5–$30/month additional cost.
Metadata Filtering
Before or during retrieval, filter by document metadata. If a user asks about the return policy and your knowledge base covers multiple product categories, filter to only retrieve from documents tagged with the relevant category. This eliminates irrelevant results that might dilute the context.
In Supabase: add a WHERE clause to your similarity search query. WHERE metadata->>'category' = 'returns'. This requires storing category metadata during ingestion — another reason the metadata column in your documents table is essential.
Layer 6: Generation
The final layer: using the retrieved chunks as context to generate the answer.
System Prompt Structure for RAG
The system prompt for a RAG application differs from a general ChatGPT prompt. It must:
- Instruct the model to answer ONLY from the provided context
- Tell the model how to handle cases where context does not contain the answer
- Specify citation format (if you want the model to reference source documents)
- Define the response format and length constraints
Template: "You are [role] for [company]. Answer questions using ONLY the provided context documents. If the answer is not in the provided context, say: 'I don't have specific information about that in my knowledge base. Please [action].' Do not make up information. When relevant, cite the source document title."
Confidence Scoring
Implement a pre-response confidence check: if the highest similarity score from retrieval is below 0.70 (on a 0–1 cosine similarity scale), treat this as a low-confidence retrieval and route to your fallback behavior instead of attempting to answer. This prevents the model from fabricating answers when your knowledge base genuinely does not contain relevant information.
Citation Formatting
For professional applications (legal, medical, compliance), include source citations in responses. Store the document title, source URL, and section header in your metadata. Pass these to the model with instructions to cite specific sources when making factual claims. This builds trust and enables verification.
Advanced Patterns
Parent-Child Chunking
Store documents in two ways simultaneously: small chunks (200 tokens) for precise retrieval, and large parent chunks (1,000 tokens) for response generation. When a small chunk is retrieved, look up its parent chunk and use the larger context for the LLM. This gives you retrieval precision with generation context — the best of both chunk sizes.
Multi-Index RAG
Maintain separate vector indexes for different document types. For a law firm: one index for case law, one for contracts, one for internal procedures. Route the query to the appropriate index based on intent classification before retrieval. This prevents irrelevant cross-category contamination in results.
Hypothetical Document Embedding (HyDE)
Before retrieving, generate a hypothetical ideal answer to the query using the LLM, then use that hypothetical answer as the query for retrieval instead of the original question. This bridges the vocabulary gap between short questions and long document chunks. Particularly effective for technical domains. The trade-off is an additional LLM call per query, adding $0.001–$0.003 per query.
Evaluating RAG Quality
You cannot improve what you do not measure. These three metrics form the minimum evaluation framework:
- Answer Relevance: Does the generated answer address the question? Score 0–5. Evaluate by having a human (or a separate LLM call) judge whether the answer is responsive to the question.
- Faithfulness: Are all claims in the answer supported by the retrieved context? Score 0–5. Check for hallucinated facts not present in retrieved chunks.
- Context Recall: Were the right documents retrieved? If you have ground-truth Q&A pairs, check whether the correct source document appears in the top-5 retrieved results.
The RAGAS framework (available on GitHub and PyPI) automates this evaluation using an LLM-as-judge approach. Run RAGAS on a test set of 50–100 representative queries before deploying to production. We use this to validate every knowledge base we build.
Production Considerations
Document Update Pipeline
RAG systems require ongoing maintenance as your documents change. Build an update pipeline that: detects document changes (webhook from your CMS, scheduled crawl, or file modification timestamps), re-chunks the changed document, re-embeds the new chunks, and upserts to the vector store using the source URL as a unique key to replace old chunks.
Embedding Cost Optimization
Batch embedding calls: instead of embedding one chunk at a time, batch 100 chunks per API call. This uses OpenAI's batch endpoint which processes async requests at 50% cost. For a 10,000-chunk knowledge base, batching saves $0.01 vs. $0.02 on initial embedding — small savings, but they matter at scale with frequent updates.
Caching
Cache embedding results for common queries. If 30% of your queries are variations of the same 10 questions, caching those embeddings and their retrieval results eliminates repeated computation. Use Redis with a 1-hour TTL for dynamic content or 24-hour TTL for stable knowledge bases.
Common RAG Failure Modes
- Chunk size too large. 2,000-token chunks containing multiple topics contaminate each other during retrieval. The model receives context about Topic A when the user asked about Topic B, and the answer blends both incorrectly.
- No metadata filtering. Without filtering, a query about "pricing" retrieves pricing information from the wrong product line or the wrong time period. Every knowledge base needs document-level metadata.
- No re-ranking. Basic similarity search returns results ordered by vector similarity, not answer relevance. A chunk that is vectorially similar but not actually useful gets ranked above a chunk that is slightly less similar but directly answers the question. Re-ranking fixes this.
- Outdated documents in the knowledge base. Old pricing, superseded policies, and discontinued products all produce incorrect answers that are impossible to detect without source citations. Build the update pipeline before you need it, not after your first "why did the AI tell a customer our 2024 price?" incident.
- No evaluation before deployment. Deploying a RAG system without running RAGAS or equivalent evaluation means you do not know your baseline quality. You cannot measure improvement or detect regression. Always evaluate on a held-out test set before and after changes.
Building Your First RAG System
Start with Supabase pgvector and text-embedding-3-small. For document ingestion, begin with your top 50 support articles or FAQ pages — clean them manually, chunk semantically, and embed. Build the retrieval function. Write the generation prompt. Test with 20 real questions before anything else.
The first version will not be perfect. The chunking will be imperfect, some retrievals will miss, and the generation prompt will need iteration. That is normal. Run RAGAS evaluation, identify the failure modes, and fix the highest-impact issues first.
For context on how RAG fits into broader AI integration strategies, read our RAG explained: how to connect AI to your business data. For implementation as part of a managed deployment, see our AI integration services.