Indexing Algorithms - IVF & Quantization · Topic 81

What is IVF (Inverted File Index)?

An Inverted File Index (IVF) is a cluster-based index that divides the vector space into regions (clusters). At query time, only the clusters closest to the query are searched, reducing the number of distance computations compared to brute force and speeding up approximate nearest neighbor (ANN) search.

Summary

Clustering (e.g. K-means) gives centroids; each vector assigned to nearest → Voronoi cells. Inverted file: per cluster, list of vector IDs. Query: find nearest centroids (nprobe), search only those clusters. Often combined with PQ or scalar quantization.
Tune nlist and nprobe for recall vs. speed. Core block in Faiss and many vector databases.
Pipeline: cluster at build time (nlist centroids), assign vectors to nearest centroid, store inverted lists; at query time find nprobe nearest centroids and search only those lists (flat or with PQ).
Trade-off: higher nprobe improves recall and latency; larger nlist gives finer partitioning but more centroid comparisons; combine with quantization (IVFPQ) for billion-scale.
Comparing graph-based vs. cluster-based indexes and combining IVF and PQ (IVFPQ) describe alternatives and hybrid designs.

How IVF works

IVF works by first running a clustering algorithm (typically K-means) on the vectors to get a set of centroids. Each vector is assigned to its nearest centroid, forming Voronoi cells. The “inverted file” stores, for each cluster, the list of vector IDs (and optionally the vectors themselves) belonging to that cluster. To answer a query, you find the nearest centroids to the query vector (controlled by nprobe), then search only the vectors in those clusters—often combined with product quantization (PQ) or scalar quantization for further speed and memory savings.

The role of Voronoi cells in IVF is that each cell contains all points closer to one centroid than to any other; the inverted file is simply an index from centroid (cell) to the list of vectors in that cell. Search space reduction comes from only scanning the nprobe cells closest to the query instead of the full dataset.

Recall vs. speed

IVF trades some recall for speed: if the true nearest neighbor lives in a cluster you didn’t probe, you’ll miss it. Tuning the number of clusters (nlist) and how many to probe at query time (nprobe) controls this trade-off.

Practical tip: start with nlist on the order of sqrt(n) to 4×sqrt(n) and nprobe a small fraction of nlist (e.g. 1–10% of nlist); measure recall@k and latency. Increase nprobe for higher recall at the cost of query time; increase nlist for finer partitioning but more centroid comparisons at query time. Flat indexing (brute force) can be better when the dataset is small enough that scanning all vectors is fast.

Pipeline and combinations

Build pipeline: run K-means to get nlist centroids, assign each vector to its nearest centroid, build the inverted lists (optionally storing quantized codes). Query pipeline: compute distances to all centroids (or use a small index on centroids), select nprobe nearest clusters, search only those lists (with flat vectors, SQ, or PQ). Combining IVF and PQ (IVFPQ) is common for very large scale: coarse quantization by cluster plus fine quantization by PQ codes per vector.

Trade-off: IVF reduces search space but adds centroid-comparison cost; for very large n and moderate d, the reduction dominates. Memory can be reduced by storing PQ codes instead of full vectors in each list; see how PQ compresses vectors into codes and the concept of codebooks in PQ. Search space reduction techniques and flat indexing (when brute force is better) help choose the right design.

Frequently Asked Questions

Is IVF exact or approximate?

Approximate. You only search a subset of clusters; true neighbors in unprobed clusters are missed. Recall depends on nlist and nprobe.

Can I change nprobe without rebuilding?

Yes. nprobe is a query-time parameter. nlist is fixed at build time when centroids are computed.

How does IVF compare to HNSW?

IVF is cluster-based; HNSW is graph-based. IVF often uses less memory with PQ and scales to billions; HNSW often gives better recall at low latency for in-memory.

What distance metric does IVF use?

Usually L2 or inner product; must match the metric used for clustering and for search. Centroids are typically in the same space as the vectors.