Indexing Algorithms - IVF & Quantization · Topic 83

K-means clustering for index centroids

In an IVF index, the clusters are defined by centroids. These are typically computed with K-means clustering: an algorithm that partitions the vectors into K groups so that each vector is assigned to the cluster whose centroid is nearest (by L2 or cosine distance), and centroids are updated as the mean of the vectors in each cluster. The resulting centroids define the Voronoi cells used at search time.

Summary

K-means run at build time; K = nlist. More clusters → smaller cells, finer partition; tune nprobe at query time.
Often multiple restarts or mini-batch K-means for speed. Distance metric for K-means should match query-time metric.
Pipeline: run K-means at build → K = nlist centroids → assign each vector to nearest centroid → build inverted lists. Query uses same metric to find nearest centroids (nprobe).
Trade-off: more clusters (nlist) = finer partition and longer K-means; fewer = coarser partition, faster build, more vectors per probed cell.
The role of Voronoi cells in IVF and what is nlist / nprobe complete the IVF parameter picture.

Effect of K (nlist)

K-means is run once during index build (or periodically when the index is rebuilt). The number of clusters K is the IVF parameter usually called nlist. More clusters mean smaller cells and a more fine-grained partition—better potential recall but more clusters to consider at query time unless you increase nprobe. Fewer clusters mean larger cells: faster search (fewer vectors per probed cluster on average) but coarser partitioning and higher risk of missing true neighbors that lie in other cells.

Practical details

In practice, K-means for IVF is often run with multiple restarts or with a fast approximate variant (e.g. mini-batch K-means) to keep build time manageable on large datasets. The distance metric used for K-means should match the one used at query time (e.g. L2 for IVF that uses L2 for search) so that Voronoi boundaries align with how similarity is measured.

Pipeline: sample or use full data for K-means; iterate until convergence or max iterations; assign all vectors to nearest centroid; build inverted file. Measuring index build time and flat indexing (when brute force is better) help decide when IVF is worth the build cost versus scanning the full set.

Frequently Asked Questions

Can I use a different clustering algorithm?

Yes. Some systems use hierarchical or density-based clustering; K-means is standard because it’s fast and produces a fixed number of centroids (nlist).

How many K-means iterations?

Until convergence or a max iteration count. Faiss and similar often use 25–50 iterations or early stopping when centroid movement is small.

What if a cluster is empty?

K-means can produce empty clusters; they’re kept as centroids but have no vectors. Query never needs to search them if nprobe only picks non-empty or closest cells.

Does K-means scale to billions of vectors?

Mini-batch or sampled K-means is used: train on a subset, then assign all vectors to nearest centroid. See measuring index build time.