Embeddings & Data Prep · Topic 29

Re-ranking: Why do we do it after the initial search?

Re-ranking means taking the top-K candidates returned by a fast ANN search in your vector database and scoring them again with a slower, more accurate model (often a cross-encoder). You get better precision on the final list without running the expensive model on the entire collection. It is a standard two-stage pattern in production RAG, search, and recommendation systems.

Summary

Initial search: cheap bi-encoder + cosine in VDB → top 50–200 candidates.
Re-rank: cross-encoder (query + each candidate) → relevance score; too slow for full collection, fine on 50–200.
Pipeline: VDB top 100 → re-ranker scores those 100 → return top 10; improves relevance for RAG, search, recommendation; helps with ANN recall–latency and hybrid merge.
Re-ranking improves precision (order of top results), not recall—recall is capped by the initial retrieval.
Typical candidate counts: 50–200; tune vs. latency and quality; many sentence-transformers and dedicated reranker models exist.

The two-stage pipeline

The initial search uses cheap vector similarity (e.g. bi-encoder embeddings + cosine). That’s fast but can miss nuance or rank some items poorly. A re-ranker takes the query and each candidate (e.g. query + passage) and outputs a relevance score; it’s too slow to run on millions of items but fine on 50–200 candidates. So the pipeline is: VDB returns top 100 → re-ranker scores those 100 → return top 10.

That improves relevance for RAG, search, and recommendation without scanning the full collection with the expensive model. Pipeline tip: run the same bi-encoder at index and query time so the VDB returns a sensible candidate set; the re-ranker then refines only that set. When to add re-ranking: when bi-encoder-only ranking isn’t good enough (e.g. top-k has irrelevant items) or when you merge hybrid (vector + keyword) results and want one consistent order. See the lifecycle of a vector query for where re-ranking fits in the end-to-end flow.

Why not use the re-ranker for everything?

Cross-encoders and similar re-rankers process query and document together; they’re accurate but O(n) in the number of documents. Running them on the whole collection would make every query far too slow. So we use the VDB for fast ANN recall (sublinear), then apply the expensive model only on a small candidate set.

Re-ranking also helps when the initial retrieval is approximate (ANN can have recall–latency trade-offs; see trade-off between recall and latency in HNSW) or when you combine vector and keyword results and need a single ordering. Trade-off: more candidates to the re-ranker → better chance the true top-k is in the set but higher re-rank latency; fewer → faster but risk missing good items. A/B test with your recall and latency targets.

When to add re-ranking

It’s a standard pattern: fast recall with the VDB, then expensive precision with a small re-ranking step. Add re-ranking when bi-encoder-only ranking isn’t good enough or when you merge hybrid results and want one consistent order. You can tune how many candidates to pass to the re-ranker (e.g. 100 or 200) vs. final k (e.g. 10) to balance latency and quality.

Practical tips: use a dedicated reranker model (e.g. cross-encoder trained on relevance data) rather than a general embedder; cache re-ranker results when the same query is repeated; consider running the re-ranker on a GPU for throughput. Re-ranking does not add new items—it only reorders the candidate set—so recall is capped by the initial retrieval; to improve recall, improve the initial retrieval (e.g. higher k from VDB, better embeddings, or hybrid search).

Frequently Asked Questions

What model is used for re-ranking?

Often a cross-encoder: it takes query and document together and outputs a single relevance score. Trained on (query, doc, label) pairs. Many sentence-transformers and dedicated reranker models exist. See cross-encoders vs. bi-encoders for the difference.

How many candidates should I pass to the re-ranker?

Typical: 50–200. More candidates → better chance the true top-k is in the set but higher re-rank latency. Fewer → faster but risk missing good items. A/B test with your recall and latency targets; see measuring latency and measuring recall at k.

Can the VDB do re-ranking internally?

Some VDBs or search platforms integrate a reranker step (e.g. call a reranker service on the VDB’s top-k). Often re-ranking is still done in your app layer so you can choose the model and parameters.

Does re-ranking improve recall?

Re-ranking doesn’t add new items; it only reorders the candidate set. So recall is capped by the initial retrieval. Re-ranking improves precision (order of the top results). To improve recall, improve the initial retrieval (e.g. higher k from VDB, better embeddings, or hybrid).