AI Engineering

Building a RAG System That Doesn't Hallucinate

A practical guide to retrieval-augmented generation with quality guardrails — what works, what doesn't, and how to know the difference.

August 15, 20248 min readAshkan Kardan

Building a RAG System That Doesn't Hallucinate

Most RAG tutorials show you how to stuff documents into a vector database and call it a day. That works great in a demo. It falls apart in production.

The problem isn't the retrieval mechanism — it's everything around it. Chunking strategy, embedding model choice, re-ranking, prompt design, and what to do when retrieval fails. Get any one of these wrong and your system confidently answers questions with made-up information.

Here's how to build one that actually works.

The Three Failure Modes You Need to Understand

Before writing a single line of code, understand where RAG systems break:

1. Retrieval misses — The right chunk exists in your knowledge base, but the semantic search doesn't find it. This happens when your query and the relevant document use different vocabulary, or when the answer is spread across multiple chunks.

2. Context flooding — You retrieve too many chunks, the relevant information gets buried in a long context window, and the model ignores it. More context isn't always better.

3. Faithful but wrong — The model accurately reports what's in the retrieved context — but the retrieved context was the wrong thing. The model never says "I don't know."

The core principle

A RAG system can only be as good as your retrieval. Improving your retrieval strategy almost always beats improving your prompt.

Chunking: The Decision That Affects Everything Downstream

Document chunking is where most RAG systems go wrong first. Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is easy to implement but semantically blind — it splits sentences and cuts concepts in half.

Better approaches:

Semantic chunking — Split at natural boundaries (paragraphs, sections, sentences) rather than fixed character counts. LangChain's SemanticChunker and LlamaIndex's node parsers both offer this.

Hierarchical chunking — Maintain both small chunks (for precise retrieval) and larger parent chunks (for context). Retrieve the small chunk, but pass the parent to the model. This is called the "parent document retriever" pattern.

Proposition-based chunking — Extract atomic factual claims from your documents as individual chunks. More expensive but dramatically improves retrieval precision for question-answering tasks.

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)
chunks = splitter.split_documents(docs)

Re-ranking: The Step Most People Skip

Vector similarity search is fast but imprecise. The top-10 results by cosine similarity are not necessarily the top-10 by relevance to your query.

Cross-encoder re-rankers fix this. They take your query and each retrieved chunk as a pair and score them together — much more accurate than vector similarity, but too slow to run on the full corpus.

The pattern is: retrieve 20–50 candidates with vector search, then re-rank with a cross-encoder, then pass the top 4–6 to the model.

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
# Retrieved chunks from vector search
candidates = vectorstore.similarity_search(query, k=20)
 
# Re-rank
scores = reranker.predict([(query, chunk.page_content) for chunk in candidates])
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
top_chunks = [chunk for _, chunk in ranked[:5]]

Teaching the Model to Say "I Don't Know"

This is the hardest part. By default, language models will generate a plausible-sounding answer even when the retrieved context doesn't support it.

Two techniques help:

Explicit instruction in the system prompt:

Answer the question based ONLY on the context provided below.
If the context does not contain enough information to answer the question,
say "I don't have enough information to answer that accurately" —
do NOT make up or infer an answer.

Grounding verification — After generation, run a second LLM call that checks whether each claim in the response is supported by the retrieved context. Flag or filter responses that contain unsupported claims.

Monitor in production

Log every query, retrieved chunks, and response. Review the misses weekly. You'll quickly see patterns — common queries your chunking doesn't handle, vocabulary gaps between queries and documents, or whole topic areas that aren't covered.

The Production Checklist

Before shipping a RAG system:

[ ] Evaluate retrieval quality separately from generation quality
[ ] Test with adversarial queries that should return "I don't know"
[ ] Measure hallucination rate on a holdout set
[ ] Set up query logging and response feedback collection
[ ] Define a clear escalation path for failed retrievals
[ ] Monitor embedding API costs and latency

RAG done right is genuinely powerful. RAG done wrong is a liability. The difference is usually in the details above.

Enjoyed this article?

Let's work together

If you're looking to build AI-powered software for your business, I'd love to chat.

Start a Conversation