ATS Global

Building production RAG systems: what we learned from five enterprise deployments

Published: JUN 15, 202610 min readBy Abdul Kaleem

Building a RAG demo takes an afternoon. Building a RAG system that stays accurate, observable, and maintainable at enterprise scale takes months — and the gap between the two is where most AI projects stall.

Over the past 18 months, we've delivered five production RAG deployments: a compliance knowledge assistant for a financial services firm, a product documentation search for a manufacturing company, an HR policy chatbot, a technical support assistant, and an internal procurement search system. This is what we learned.

The retrieval problem is harder than it looks

In every project, the retrieval step — finding the right chunks to pass to the LLM — turned out to be more important than the LLM itself. A well-prompted GPT-4o with bad retrieval gives worse answers than a well-configured retrieval system with a smaller model.

The failure modes we saw most often:

  • Chunk size too large: the relevant sentence was buried in a 1,000-token chunk that diluted the semantic signal
  • No hybrid search: pure vector similarity missed exact-match queries that keyword search handles trivially
  • Stale embeddings: the document corpus was updated but the vector index wasn't re-synced, so the assistant returned outdated information confidently
  • Missing metadata filters: users queried across document types they weren't authorised to access

Evaluation before deployment — and continuously after

The projects that went well had evaluation frameworks built before the first production prompt was written. The projects that struggled were evaluated informally — someone clicked around the demo and said 'looks good'.

Our evaluation stack for production RAG:

  • A golden dataset of 50–200 question-answer pairs built with domain experts, not engineers
  • Automated retrieval precision and recall scoring against the golden dataset on every index change
  • LLM-as-judge evaluation for response quality (factual accuracy, completeness, hallucination detection)
  • A/B testing infrastructure so we could compare retrieval strategies without a full redeploy

The document processing step is a product decision

Every enterprise knowledge base has messy documents: scanned PDFs, tables embedded in Word files, PowerPoints with text in image form, HTML pages with navigation and footer noise. How you handle this is a product decision, not just a technical one.

We use Docling as our primary document processing layer — it handles layout-aware PDF extraction, table structure preservation, and image-based text extraction better than most alternatives we've tested. For HTML content, we've built custom extractors that strip navigation and boilerplate before chunking.

Chunking strategy matters more than model choice

We've settled on a hybrid chunking approach: semantic chunking for narrative content, fixed-size with overlap for technical documentation, and table-aware chunking that preserves row and column relationships for structured data. The same model with different chunking strategies can produce answer quality differences of 20–30% on domain-specific queries.

Observability is non-negotiable

Production RAG without observability is a black box. You can't improve what you can't measure, and you can't debug what you can't trace. Every production deployment we run has:

  • Per-query logging of the retrieved chunks, the prompt sent to the LLM, and the response
  • Latency tracking across retrieval, prompt assembly, and inference stages separately
  • User feedback capture (thumbs up/down minimum, free text when possible)
  • Automated alerts for retrieval failures, LLM API errors, and latency spikes

What we'd tell clients before starting

RAG is a retrieval and orchestration problem, not an LLM problem. The model is the easy part. Invest in your document processing pipeline, your evaluation framework, and your observability infrastructure — these are what determine whether the system is still working six months after go-live.

Let’s work together

Ready to build something great?

Tell us about your project. We respond within one business day and can provide a customised quote for your requirements.