Embeddings & Data Prep · Topic 40

Cross-encoders vs. Bi-encoders

Bi-encoders encode the query and each document separately into vectors, then similarity (e.g. cosine or dot product) is used to rank. Cross-encoders take the query and one document together as input and output a single relevance score. The choice determines how you use a vector database and whether you add a re-ranking step. Typical pattern: bi-encoder for retrieval, cross-encoder for re-ranking top-K.

Summary

Bi-encoder: query and document encoded separately → vectors → similarity (e.g. cosine); precompute doc vectors, store in VDB; fast at scale.
Cross-encoder: (query, document) together → one relevance score; accurate but O(n) per doc; use for re-ranking top-K only.
Typical: bi-encoder for retrieval + cross-encoder for re-ranking; when choosing model for VDB, you choose a bi-encoder.
Cross-encoders cannot be used for initial VDB retrieval because you cannot precompute document vectors; re-ranking often improves precision of top 5–10 noticeably.

Bi-encoders for retrieval

Bi-encoders are the standard for semantic search at scale: you precompute document embeddings and store them in the VDB, encode the query once, and run ANN search. No need to touch every document at query time, so latency stays low even with millions of vectors.

The downside is that the query never “sees” the document—similarity is computed from two independent vectors—so fine-grained relevance (e.g. exact entailment or nuanced matching) can be weaker. When to use: for all initial retrieval in a vector database; when choosing an embedding model for the VDB, you’re choosing a bi-encoder. Pipeline: index with bi-encoder → store vectors → query with same bi-encoder → ANN → optionally re-rank with cross-encoder.

Cross-encoders for re-ranking

Cross-encoders take (query, document) as a pair and output a score. They are much more accurate for relevance because the model can attend across query and document, but they are expensive: you’d need to run the model once per candidate, which doesn’t scale to the full corpus.

So the typical pattern is bi-encoder for retrieval + cross-encoder for re-ranking: use the VDB to get top-K candidates (e.g. 100) with the bi-encoder, then run a cross-encoder on those 100 (query, doc) pairs and re-sort. You get the speed of vector search with the accuracy of cross-encoder on a small set. Bi-encoder and cross-encoder do not need to be from the same family but are often from the same provider for consistency. You cannot use two different bi-encoders for query vs. doc unless they’re trained to produce comparable vectors (e.g. same space); asymmetric encoders exist as a special case. Re-ranking with a cross-encoder often gives a noticeable gain in precision of the top 5–10 (e.g. 10–20% in nDCG or MRR); see re-ranking after initial search.

Frequently Asked Questions

Can I use a cross-encoder for the initial retrieval in a VDB?

No. Cross-encoders need (query, doc) together so you can’t precompute document vectors. You’d have to run the cross-encoder on every doc per query, which doesn’t scale. Use a bi-encoder for retrieval, then cross-encoder on the top-K.

Do bi-encoder and cross-encoder need to be from the same family?

No, but they’re often from the same provider (e.g. sentence-transformers) for consistency. The bi-encoder feeds the VDB; the cross-encoder only re-ranks the list the VDB returns.

Can I use two bi-encoders (e.g. one for query, one for doc)?

Only if they’re trained to produce comparable vectors (e.g. in the same space). Typically one bi-encoder encodes both; query and doc vectors are then comparable. Asymmetric encoders exist for query vs. doc but are a special case.

How much does re-ranking with a cross-encoder improve results?

Often a noticeable gain in precision of the top 5–10 (e.g. 10–20% in metrics like nDCG or MRR). Depends on task and data. See re-ranking after initial search.