The role of Voronoi cells in IVF
In an IVF (Inverted File Index), the vector space is partitioned into regions where every point in a region is closer to one centroid than to any other. These regions are Voronoi cells; they define which vectors belong to which cluster and thus which subset of the dataset is searched at query time.
Summary
- Voronoi cell of a centroid = all points whose nearest centroid is that one. Inverted file: per centroid, list of vectors in that cell. Query: find cell(s) for query (and neighbors via nprobe), search only those vectors.
- Partition quality depends on centroids; tuning nlist and nprobe balances recall and speed.
- Pipeline: K-means yields centroids → assign each vector to nearest centroid (Voronoi cell) → build inverted list per cell; query finds nprobe nearest cells and searches only those lists.
- Trade-off: finer partition (more nlist) reduces vectors per cell but increases risk of missing neighbors across boundaries; nprobe compensates by probing more cells.
- Space-partitioning trees use different (e.g. axis-aligned) splits; Voronoi is centroid-based. What is IVF and the role of Voronoi cells tie together.
Definition and use in IVF
Given a set of centroids (from K-means or similar), the Voronoi cell of a centroid is the set of all points whose nearest centroid is that one. In IVF, each stored vector is assigned to the centroid it’s closest to—so each vector lies in exactly one Voronoi cell. The inverted file is then: for each centroid (cell), the list of vectors in that cell. When you get a query vector, you find which cell it falls into (and optionally a few neighboring cells, controlled by nprobe), and you only search those vectors. That’s how IVF reduces the search space from the full dataset to a small number of clusters.
When the partition hurts recall
The quality of the partition depends on the centroids: if clusters are well separated, Voronoi boundaries are meaningful and search is both fast and accurate. If the data is spread out or clusters overlap in the embedding space, the true nearest neighbor might sit just across a Voronoi boundary in an unprobed cell, which is why IVF is approximate.
Practical tip: use nprobe > 1 to probe the query’s cell plus a few nearest neighboring cells; that reduces boundary effects. K-means clustering for index centroids and what is nlist define how many cells you have; what is nprobe controls how many you search. Combining IVF and PQ (IVFPQ) adds a second level of quantization inside each cell.
Frequently Asked Questions
How many Voronoi cells are there in IVF?
One per centroid; the number of centroids is nlist. So there are nlist cells partitioning the space.
Can a query fall on a cell boundary?
Yes. Ties are rare in practice; the query is assigned to the nearest centroid. Probing multiple cells (nprobe > 1) reduces the risk of missing neighbors near boundaries.
Are Voronoi cells convex?
Yes. For L2 distance, each cell is a convex polytope (intersection of half-spaces). That’s why nearest-centroid assignment is well defined.
Do other indexes use Voronoi?
Any partition based on “nearest centroid” is a Voronoi partition. Space-partitioning trees use different (e.g. axis-aligned) splits.