Build a Semantic Search Engine from Scratch: A Step-by-Step Python Guide

Build a Semantic Search Engine from Scratch: A Step-by-Step Python Guide

Keyword search is fast, simple, and well-understood — but it fails the moment a user types something your documents don’t literally contain. Search for “cardiac arrest” in a corpus full of “heart attack” and you’ll get nothing back. Semantic search fixes this by working in the space of meaning rather than tokens, unlocking retrieval that actually understands intent.

In this guide you’ll build a production-ready semantic search stack in Python, end-to-end: chunking, embedding, vector indexing, hybrid retrieval, and cross-encoder re-ranking — all wrapped in a deployable FastAPI service.


1. Why Keyword Search Falls Short

BM25, the backbone of Elasticsearch and most search engines, scores documents by term-frequency statistics. It has no awareness that “automobile” and “car” are synonyms, or that “cheap flight” and “budget airline ticket” describe the same intent. The result: high precision for exact matches, poor recall for everything else.

Semantic search encodes both queries and documents as dense vectors in a shared embedding space. Documents semantically close to a query land nearby, regardless of surface wording. The practical upside: dramatically better recall, better multilingual support, and results that actually reflect what users mean.


2. Document Ingestion: Chunking Strategies

Large documents must be split before embedding — most models cap at 512–8192 tokens, and smaller, focused chunks produce sharper embeddings.

Three chunking strategies to know:

  • Fixed-size chunking — split every N tokens with an overlap window. Fast, predictable, but can cut mid-sentence.
  • Sentence-aware chunking — use nltk or spaCy to split on sentence boundaries, then group sentences until a token budget is hit. Preserves semantic units.
  • Recursive character splitting — LangChain’s RecursiveCharacterTextSplitter tries paragraph → sentence → word splits in order. Best general-purpose choice.
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Generating Embeddings

For open-source embeddings, sentence-transformers is the go-to library. BAAI/bge-large-en-v1.5 and intfloat/e5-large-v2 top the MTEB leaderboard and run comfortably on a single GPU.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
vectors = model.encode(chunks, batch_size=64, normalize_embeddings=True)

For hosted embeddings, OpenAI’s text-embedding-3-large (3072 dimensions, reducible via MRL) offers strong performance with zero infrastructure:

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-large",
    input=chunks,
    dimensions=1536  # reduce for cost savings
)
vectors = [r.embedding for r in response.data]

3. Indexing into a Vector Database

Once you have vectors, you need a store that supports ANN (approximate nearest-neighbor) search. Qdrant is an excellent choice — it’s open-source, Docker-friendly, and supports payload filtering out of the box.

docker run -p 6333:6333 qdrant/qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

points = [
    PointStruct(
        id=i,
        vector=vectors[i].tolist(),
        payload={"text": chunks[i], "source": "manual.pdf", "page": i // 5}
    )
    for i in range(len(chunks))
]
client.upsert(collection_name="docs", points=points)

Payload metadata enables filtered search — restrict results to a specific document, date range, or category before the ANN step runs:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="docs",
    query_vector=query_vec,
    query_filter=Filter(
        must=[FieldCondition(key="source", match=MatchValue(value="manual.pdf"))]
    ),
    limit=20
)

4. Hybrid Retrieval: BM25 + Dense with Reciprocal Rank Fusion

Dense retrieval excels at semantic similarity; BM25 excels at exact keyword matches (product codes, names, acronyms). Hybrid retrieval runs both in parallel and merges results with Reciprocal Rank Fusion (RRF).

RRF formula for a document d across ranked lists: RRF(d) = Σ 1 / (k + rank(d)), where k=60 is a smoothing constant.

from rank_bm25 import BM25Okapi
from collections import defaultdict

# Build BM25 index
tokenized = [chunk.lower().split() for chunk in chunks]
bm25 = BM25Okapi(tokenized)

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores = defaultdict(float)
    for ranked_list in rankings:
        for rank, doc_id in enumerate(ranked_list):
            scores[doc_id] += 1.0 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

# Query both retrievers
query = "heart attack symptoms"
query_vec = model.encode([query], normalize_embeddings=True)[0]

dense_hits = [str(r.id) for r in client.search("docs", query_vec, limit=20)]
bm25_scores = bm25.get_scores(query.lower().split())
bm25_hits = [str(i) for i in bm25_scores.argsort()[::-1][:20]]

fused = reciprocal_rank_fusion([dense_hits, bm25_hits])
top_candidates = fused[:20]  # pass to re-ranker

5. Re-ranking with a Cross-Encoder

Bi-encoder retrieval (query and document encoded separately) trades precision for speed. For your top-20 candidates, a cross-encoder processes the query and each document jointly, producing far more accurate relevance scores — at a cost acceptable for a small candidate set.

Options:

  • Cohere Rerank API — hosted, production-grade, single API call
  • BAAI/bge-reranker-large — open-source, runs locally
# Option A: Cohere Rerank (hosted)
import cohere
co = cohere.Client("YOUR_API_KEY")

candidate_texts = [chunks[int(i)] for i in top_candidates]
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=candidate_texts,
    top_n=5
)
final_results = [candidate_texts[r.index] for r in reranked.results]

# Option B: BGE Reranker (local)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large")
scores = reranker.predict([(query, t) for t in candidate_texts])
final_results = [t for _, t in sorted(zip(scores, candidate_texts), reverse=True)][:5]


6. Wrapping It in FastAPI

Tie everything together in a deployable API:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class SearchRequest(BaseModel):
    query: str
    top_k: int = 5

@app.post("/search")
def search(req: SearchRequest):
    query_vec = model.encode([req.query], normalize_embeddings=True)[0]
    dense_hits = [str(r.id) for r in client.search("docs", query_vec, limit=20)]
    bm25_hits = [str(i) for i in bm25.get_scores(req.query.lower().split()).argsort()[::-1][:20]]
    fused = reciprocal_rank_fusion([dense_hits, bm25_hits])[:20]
    candidates = [chunks[int(i)] for i in fused]
    scores = reranker.predict([(req.query, c) for c in candidates])
    ranked = sorted(zip(scores, candidates), reverse=True)[:req.top_k]
    return {"results": [text for _, text in ranked]}

Run with uvicorn app:app --reload and test at http://localhost:8000/docs.


Production Checklist

Before going live, verify:

  • Embedding model pinned — embedding model updates break index compatibility; version-lock your model.
  • Async upserts — batch index updates off the request path using a queue (Celery, ARQ).
  • Metadata filtering first — apply Qdrant filters before ANN to shrink the search space.
  • Monitor latency per stage — retrieval, re-rank, and total P95; set alerts on each.
  • Cache hot queries — Redis TTL cache on (query_hash, filters) cuts repeat latency to milliseconds.

Summary

You now have a complete semantic search pipeline:

  • Chunk documents with recursive splitting
  • Embed with bge-large or text-embedding-3-large
  • Index into Qdrant with rich metadata payloads
  • Retrieve with hybrid BM25 + dense via RRF
  • Re-rank top candidates with a cross-encoder
  • Serve via FastAPI
  • This architecture handles the full spectrum from exact keyword hits to deep semantic matches — giving your users results that feel genuinely intelligent rather than just lexically lucky.

    Leave a Reply

    Your email address will not be published. Required fields are marked *