← All topics

Filtering & Querying · Topic 128

Why Pre-filtering is hard for ANN indexes

Pre-filtering would ideally restrict the search to only points that match the metadata filter before or during traversal. The problem is that ANN indexes (e.g. HNSW, IVF) are built assuming all points are valid; when you skip a subset, the structure can’t guarantee that the true nearest neighbors within the filtered set are still reachable. This topic explains why and what workarounds exist.

Summary

  • Pre-filtering restricts search to points matching the metadata filter; ANN indexes (HNSW, IVF) are built over all points, so skipping a subset can make true nearest neighbors unreachable.
  • In HNSW, paths to the true nearest in the filtered set may go through ineligible nodes; in IVF, clusters may have mostly ineligible points. Result: lower recall or higher latency from oversearching.
  • Workarounds: post-filtering with larger candidate set; separate indexes per filter value (e.g. per tenant); in-bitmap checks during traversal; hybrid strategies when the filter is very selective.
  • Trade-off: pre-filter gives correct subset but risks recall; post-filter or oversearch trades latency for recall; per-tenant indexes preserve connectivity but add operational cost.
  • Practical tip: use in-bitmap when the VDB supports it and tune efSearch; fall back to post-filter with oversample for very selective filters; consider per-tenant indexes only when cardinality is low.

Why connectivity and recall suffer

In HNSW, the graph links connect nearest neighbors. If you ignore half the nodes (those that don’t match the filter), the path from the entry point to the true nearest neighbor in the filtered set might go through a skipped node—so the traversal never finds it. You either accept lower recall or you oversearch (visit more nodes and re-check the filter), which increases latency.

In IVF, clusters are built over the full set; a filter might leave only a few points in each cluster, and the query’s nearest cluster might have mostly ineligible points, again hurting recall. Pipeline: filter → build eligible set (e.g. bitmap); during ANN traversal only consider eligible points; if too few are found, expand search (more nodes or clusters) at the cost of latency.

Workarounds

Workarounds include: (1) post-filtering with a larger candidate set; (2) building separate indexes per filter value (e.g. one index per tenant) when the filter is low-cardinality; (3) in-bitmap checks during traversal and continuing to expand until enough filtered results are found; (4) hybrid approaches that pre-filter only when the filter is very selective and the index supports it.

No single strategy fits all workloads—understanding this tension helps when tuning and choosing a VDB. Practical tip: benchmark recall and latency with your filter selectivity; if in-bitmap is available, increase efSearch when filters are active to improve recall.

Frequently Asked Questions

Why does pre-filtering hurt recall in HNSW?

HNSW graph links connect nearest neighbors over all points. When you skip nodes that don’t match the filter, the path from the entry point to the true nearest neighbor in the filtered set might go through a skipped node, so traversal never finds it. See HNSW and in-bitmap filtering.

What can I do instead of pre-filtering?

Use post-filtering with a larger candidate set; build separate indexes per filter value (e.g. per tenant) when cardinality is low; or use in-bitmap checks and expand until you have enough results. Compare metadata filtering basics.

Does IVF have the same problem?

Yes. IVF clusters are built over the full set; a filter may leave few points per cluster, and the query’s nearest cluster might have mostly ineligible points, hurting recall. Oversearching (more clusters or more points per cluster) increases latency.

When should I use a separate index per tenant?

When the filter is low-cardinality (e.g. tenant_id) and you query by that filter often. Each index is built over a subset, so connectivity is preserved. Trade-off: more indexes to build and maintain. See multi-tenant isolation.