Similarity Metrics (Mathematical Foundations) · Topic 55

Impact of distance metrics on recall

Recall (e.g. Recall@K) measures how many of the “true” nearest neighbors (under the metric you care about) are actually returned by the system. The choice of distance metric affects both the definition of “true” neighbors and how well an ANN index approximates them.

Summary

If you train embeddings for one metric but query with another, ranking can change. Define recall with the same metric used at query time. Many ANN algorithms (e.g. HNSW) are tuned for a specific metric.
Use one metric consistently; match it to your embedding model; tune ANN parameters (e.g. efSearch). Benchmark recall with your chosen metric.
For normalized vectors, L2 and cosine give the same neighbor ordering, so recall@K is identical; for unnormalized data, metric choice can materially change recall.
Pipeline: choose metric at collection creation, match to embedding model, tune efSearch/efConstruction (or equivalent), then measure Recall@K with brute-force ground truth.
Trade-off: wrong metric can hurt both ranking and ANN approximation quality; consistent metric + tuned index parameters maximize recall for given latency.

Metric consistency

If you train or tune your embeddings for one metric (e.g. cosine) but query with another (e.g. L2), the ranking of neighbors can change. So “good” recall must be defined with respect to the same metric you use at query time.

Many ANN algorithms are designed and tuned for a specific metric; using a different one can hurt recall–latency trade-offs or require different index parameters. See recall–latency trade-offs in HNSW for how efSearch and other knobs interact with the chosen distance. Practical tip: when evaluating a new embedding model, run Recall@K with the same metric the model was trained for (e.g. cosine or dot product).

Normalized vectors and tuning

For normalized vectors, cosine and L2 order neighbors identically; in practice, after normalization, L2 and cosine give the same rankings, so recall@K is the same. When vectors are not normalized, metric choice can materially change which items are “true” nearest neighbors and thus recall.

To maximize recall you should: (1) pick one metric and use it consistently for indexing and querying, (2) match the metric to your embedding model’s training objective, and (3) tune ANN parameters for that metric. Benchmarking recall with your chosen metric (e.g. via the ANN-Benchmarks suite or a custom eval) is the only way to know how your system really performs. Trade-off: higher efSearch usually improves recall at the cost of latency; see the recall–latency curve for your index type.

Practical tip: compute ground-truth top-K with brute-force (same metric as your index) on a sample of queries, then run ANN and measure overlap. Measuring Recall@K and the recall–latency trade-off curve guide how to report and tune. Thresholding and normalized vs. unnormalized scores affect how you interpret “good” matches but not the definition of recall itself.

Frequently Asked Questions

Does recall depend on the metric?

Yes. “True” nearest neighbors are defined by the metric. Recall@K is “fraction of true top-K returned”; change the metric and true top-K change.

Can I get high recall with the wrong metric?

You might get high recall for that metric’s ranking, but if it doesn’t match user expectations or model training, quality can be poor. Match metric to model and use case.

Do L2 and cosine give the same recall for normalized vectors?

They give the same ordering of neighbors, so recall (e.g. Recall@K) can be identical. If vectors aren’t normalized, order and recall can differ.

How do I measure recall?

Compute exact top-K with your metric (brute-force), then run ANN and count overlap.