Indexing Algorithms - HNSW · Topic 78

The trade-off between Recall and Latency in HNSW

In HNSW, recall (how often the true nearest neighbors appear in the result set) and latency (time per query) are in tension. Higher recall usually requires visiting more nodes and doing more distance comparisons; that increases latency.

Summary

Main query-time lever: efSearch (see defining efSearch). Build-time: efConstruction and M affect graph quality. Pick target recall or latency and tune parameters.
Plot recall–latency curve for your dataset; for low latency use modest efSearch; for max recall increase efSearch at cost of p99.
Pipeline: more candidates (efSearch) → more distance computations and heap ops → higher recall and higher latency; graph quality (M, efConstruction) shifts the curve.
Trade-off: there is no free lunch—higher recall costs more time; a well-built graph lets you get good recall at lower efSearch.

Parameters that drive the trade-off

The main lever at query time is efSearch: a larger candidate list during the greedy search tends to improve recall@K but increases work per query and thus latency. At build time, efConstruction and M affect graph quality: higher values generally yield better recall for a given efSearch but use more memory and build time. So you can “spend” build-time cost (efConstruction, M) to get a better graph, which then lets you achieve the same recall with a lower efSearch and thus lower query latency.

M also affects query latency directly: more neighbors per node means more candidates to expand at each hop, so higher M can increase query time as well as memory. Balancing M for recall vs. latency and memory is part of tuning.

Pipeline: from efSearch to recall and latency

The query pipeline expands a candidate set of size up to efSearch. Larger efSearch means more nodes are considered, more distance computations, and more heap operations—so recall tends to go up and latency goes up. Graph quality (from M and efConstruction) determines how quickly the search finds good candidates; a better graph means you can reach the same recall with a smaller efSearch, shifting the recall–latency curve in your favor.

Tuning in practice

Tuning involves picking a target recall or latency and adjusting efSearch (and if needed, efConstruction and M) to meet it. The recall–latency trade-off curve is often plotted for a dataset to choose an operating point: fix k, run queries at several efSearch values, measure recall@k and latency (e.g. p50, p99), then plot and pick. For low-latency serving, use a modest efSearch and accept slightly lower recall; for maximum recall (e.g. offline evaluation), increase efSearch at the cost of higher p99 latency.

The curve is dataset-dependent: dimension, distribution, and size all affect how much efSearch you need for a given recall; always benchmark on your data.

Trade-offs and practical tips

Trade-off summary: recall and latency are in tension; a well-built graph (high efConstruction, good M) lets you get high recall with moderate efSearch, but beyond that more recall needs more efSearch and more time. Practical tip: measure recall@k and latency (p50, p99) for a range of efSearch values; choose the smallest efSearch that meets your recall target so latency stays within SLA.

Frequently Asked Questions

Can I have high recall and low latency?

Up to a point. A well-built graph (high efConstruction, good M) lets you get high recall with a moderate efSearch; beyond that, more recall needs more efSearch and more time.

Does M affect latency?

Yes. Higher M means more neighbors to expand at each hop, so query time can increase. Balance M for recall vs. latency and memory.

How do I measure the trade-off?

Fix k, run queries at several efSearch values, measure recall@k and latency; plot and pick an operating point.

Is the curve dataset-dependent?

Yes. Dimension, distribution, and size all affect how much efSearch you need for a given recall; always benchmark on your data.