Semantic Search as the Brain of Your LLM App: Building a Production RAG Pipeline

Semantic Search as the Brain of Your LLM App: Building a Production RAG Pipeline

Large language models are only as good as the context you give them. Without reliable retrieval, even the most capable LLM will hallucinate, go stale, or answer from generic training data instead of your domain knowledge. Retrieval-Augmented Generation (RAG) fixes this — but the quality of your RAG system lives or dies on the design of its retrieval layer.

This guide walks through the five engineering decisions that separate a demo-quality RAG prototype from a production-grade AI assistant.


1. Why Retrieval Precision Directly Controls LLM Quality

The LLM in a RAG system is a reader, not a searcher. Its job is to reason over the context you inject into the prompt — and if that context is noisy, off-topic, or missing the key passage, no amount of model sophistication will compensate.

Research consistently shows that retrieval recall and ranking quality are the primary drivers of RAG answer accuracy. Poor retrieval produces two failure modes:

  • Hallucination: The model fills context gaps with confident-sounding invention.
  • Answer drift: Correct documents exist in the index but weren’t retrieved, so the model defaults to parametric (training-time) knowledge.

Every architectural decision downstream is really about maximizing the signal-to-noise ratio of what lands in the prompt window.


2. Chunking: The Often-Underestimated Foundation

Before you can embed or index anything, you need to decide how to split your documents. Chunking strategy has an outsized effect on retrieval quality because embeddings encode meaning at the chunk level — get the boundaries wrong, and you corrupt the semantic signal.

Chunk size tradeoffs:

  • Small chunks (128–256 tokens): Higher embedding precision, easier to isolate specific facts, but risk breaking explanations mid-thought. Works well for FAQ-style or structured data.
  • Large chunks (512–1024 tokens): Better context coherence, but embeddings become diluted across multiple topics, reducing retrieval precision.
  • Sentence-window or hierarchical chunking: A practical middle ground — embed small units for precise retrieval, then expand to surrounding context before injecting into the prompt.

Overlap matters too. A 10–20% token overlap between adjacent chunks prevents boundary artifacts where a key sentence is split across two non-adjacent retrievals. For technical documentation, consider chunking by logical structure (sections, code blocks, headings) rather than token count alone.


3. Embedding and Indexing: Choosing the Right Tools

Once your chunks are defined, you need to embed them into a vector space and store them for fast retrieval.

Bi-encoder models (e.g., sentence-transformers/all-mpnet-base-v2, text-embedding-3-large) map both queries and documents into the same dense vector space. They’re fast at inference time because document embeddings are precomputed. For domain-specific corpora — legal, medical, financial — fine-tuning a bi-encoder on in-domain pairs delivers measurable gains over general-purpose models.

Index algorithm: HNSW

Hierarchical Navigable Small World (HNSW) graphs are the de facto standard for approximate nearest neighbor (ANN) search. Libraries like FAISS, Qdrant, and Weaviate all support HNSW and offer sub-millisecond query times at million-document scale. Key parameters to tune:

  • M (connections per node): Higher values improve recall but increase memory.
  • ef_construction: Controls index build quality vs. build time.
  • ef_search: Query-time recall/latency tradeoff — increase it in production if recall is more critical than P99 latency.

Metadata filtering is often the difference between a generic and a truly useful retrieval system. Tag every chunk with structured metadata (author, date, document type, product version, access tier) and apply pre-filters before the ANN search. This narrows the search space to relevant partitions and keeps results domain-coherent without sacrificing speed.


4. The Two-Stage Retrieve-Then-Rerank Pipeline

A single-stage vector search is fast but imprecise. Production RAG systems use a two-stage pipeline:

Stage 1 — Fast recall (bi-encoder): Retrieve the top-K candidates (typically 20–50) using your ANN index. Speed is the priority here.

Stage 2 — Precision reranking (cross-encoder): Pass the query and each candidate chunk together through a cross-encoder model (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2 or Cohere Rerank). Cross-encoders perform full attention over the query-document pair, producing a relevance score far more accurate than cosine similarity. You then select the top-N reranked results (typically 3–10) for the prompt.

The performance gains are significant: reranking consistently improves NDCG@10 by 10–25% in benchmark evaluations, and qualitative answer quality improvements in user-facing systems are even more pronounced.

Implementation tip: If cross-encoder latency is a concern, run reranking asynchronously or cache reranked results for frequent queries.


5. Knowledge Freshness: Keeping Your Index in Sync

A RAG system that can’t reflect recent information is a liability in production. Stale retrieval is one of the most common real-world failure modes.

Strategies for fresh indexes:

  • Streaming ingestion pipelines: Connect document sources (Confluence, Notion, S3, databases) to ingestion workers that trigger re-embedding and upsert on every change event. Tools like Kafka, Pub/Sub, or webhook-driven pipelines work well here.
  • Document versioning and soft deletes: Never hard-delete outdated chunks immediately. Maintain version metadata and filter by is_latest=true at query time. This lets you roll back if an ingestion job produces bad embeddings.
  • TTL-based cache invalidation: If you cache reranked results for popular queries, set conservative TTLs (minutes, not hours) for time-sensitive domains, or invalidate cache entries when source documents change.
  • Incremental re-indexing: For large corpora, avoid full re-index jobs. Track document content hashes and only re-embed chunks whose source content has changed.

Putting It Together

A production RAG pipeline is a system with multiple interdependent moving parts — each layer amplifying or degrading the quality of the next. The mental model to keep:

  • Chunk for semantic coherence.
  • Embed with a domain-appropriate model.
  • Index with HNSW and metadata filters.
  • Retrieve broadly, then rerank precisely.
  • Keep the index fresh with event-driven updates.
  • Get these five layers right, and your LLM stops hallucinating and starts answering from ground truth. That’s the difference between a clever demo and a reliable AI system your users can trust.

    Leave a Reply

    Your email address will not be published. Required fields are marked *