Back to Blog
Technical Deep Dive

The RAG Pipeline Playbook: Lessons from Building Production AI

August 5, 20257 min readSix OneAI & Engineering

Retrieval-Augmented Generation sounds simple in theory — chunk documents, embed them, retrieve relevant context, generate answers. In practice, every step hides pitfalls. Here's what we've learned building RAG systems that actually work in production.

Every team building with LLMs eventually arrives at the same conclusion: the model alone isn't enough. You need grounding. You need your own data in the loop. And that means building a RAG pipeline. The concept is straightforward — retrieve relevant document chunks, inject them as context, let the LLM generate an answer. But the gap between a demo and a production system is enormous.

Chunking strategy is where most teams go wrong first. Fixed-size chunks (500 tokens, 1000 tokens) are easy to implement but terrible for retrieval quality. A chunk that splits a paragraph mid-thought, or separates a table header from its data, produces garbage results. We've moved to semantic chunking — splitting on natural boundaries like paragraphs, sections, and page breaks — with overlap to preserve context at boundaries.

Embedding quality determines everything downstream. We've tested dozens of embedding models and consistently found that domain matters. A general-purpose model might rank a loosely related passage above the exact answer because it optimizes for semantic similarity, not factual relevance. For specialized domains (legal, medical, financial), fine-tuning embeddings on domain-specific data delivers measurable retrieval improvements.

The retrieval step needs more than vector similarity. Pure semantic search misses keyword-exact matches (like contract clause numbers or product IDs), while keyword search misses semantic intent. Hybrid retrieval — combining dense vector search with sparse BM25 scoring — consistently outperforms either approach alone. We use reciprocal rank fusion to merge results from both retrievers.

Context window management is an underappreciated art. Just because a model accepts 200K tokens doesn't mean you should stuff the context window. More context means more noise, longer latency, and higher cost. We've found that carefully selecting 3-5 highly relevant chunks outperforms dumping 50 marginally relevant ones. A reranking step between retrieval and generation — using a cross-encoder model — dramatically improves precision.

Citation and traceability close the trust gap. In every production RAG system we've built, users need to verify answers. That means every generated claim needs to link back to a specific source passage, with page number, section, or timestamp. We accomplish this by asking the model to cite its sources inline and then validating those citations against the retrieved context programmatically. It's extra work, but it's the difference between a toy and a tool.

Ready to build something similar?

We'd love to hear about your project. Let's discuss how we can deliver the same kind of results for your team.

Start a Project