Building a RAG demo takes an afternoon. Building a RAG system that stays accurate, observable, and maintainable at enterprise scale takes months — and the gap between the two is where most AI projects stall.
Over the past 18 months, we've delivered five production RAG deployments: a compliance knowledge assistant for a financial services firm, a product documentation search for a manufacturing company, an HR policy chatbot, a technical support assistant, and an internal procurement search system. This is what we learned.
The retrieval problem is harder than it looks
In every project, the retrieval step — finding the right chunks to pass to the LLM — turned out to be more important than the LLM itself. A well-prompted GPT-4o with bad retrieval gives worse answers than a well-configured retrieval system with a smaller model.
The failure modes we saw most often:
- Chunk size too large: the relevant sentence was buried in a 1,000-token chunk that diluted the semantic signal
- No hybrid search: pure vector similarity missed exact-match queries that keyword search handles trivially
- Stale embeddings: the document corpus was updated but the vector index wasn't re-synced, so the assistant returned outdated information confidently
- Missing metadata filters: users queried across document types they weren't authorised to access
Evaluation before deployment — and continuously after
The projects that went well had evaluation frameworks built before the first production prompt was written. The projects that struggled were evaluated informally — someone clicked around the demo and said 'looks good'.
Our evaluation stack for production RAG:
- A golden dataset of 50–200 question-answer pairs built with domain experts, not engineers
- Automated retrieval precision and recall scoring against the golden dataset on every index change
- LLM-as-judge evaluation for response quality (factual accuracy, completeness, hallucination detection)
- A/B testing infrastructure so we could compare retrieval strategies without a full redeploy
The document processing step is a product decision
Every enterprise knowledge base has messy documents: scanned PDFs, tables embedded in Word files, PowerPoints with text in image form, HTML pages with navigation and footer noise. How you handle this is a product decision, not just a technical one.
We use Docling as our primary document processing layer — it handles layout-aware PDF extraction, table structure preservation, and image-based text extraction better than most alternatives we've tested. For HTML content, we've built custom extractors that strip navigation and boilerplate before chunking.
Chunking strategy matters more than model choice
We've settled on a hybrid chunking approach: semantic chunking for narrative content, fixed-size with overlap for technical documentation, and table-aware chunking that preserves row and column relationships for structured data. The same model with different chunking strategies can produce answer quality differences of 20–30% on domain-specific queries.
Observability is non-negotiable
Production RAG without observability is a black box. You can't improve what you can't measure, and you can't debug what you can't trace. Every production deployment we run has:
- Per-query logging of the retrieved chunks, the prompt sent to the LLM, and the response
- Latency tracking across retrieval, prompt assembly, and inference stages separately
- User feedback capture (thumbs up/down minimum, free text when possible)
- Automated alerts for retrieval failures, LLM API errors, and latency spikes
What we'd tell clients before starting
RAG is a retrieval and orchestration problem, not an LLM problem. The model is the easy part. Invest in your document processing pipeline, your evaluation framework, and your observability infrastructure — these are what determine whether the system is still working six months after go-live.