Indexing Algorithms - IVF & Quantization · Topic 101

Space-partitioning trees

Space-partitioning trees recursively split the vector space with hyperplanes so that each leaf holds a subset of points. Examples include KD-trees, ball trees, and the random-projection trees used in Annoy. At query time you descend from the root, following the side of each split the query lies in, and optionally search nearby branches to improve recall. They provide a structured form of search space reduction; in high dimensions they can suffer from the curse of dimensionality, which is why graph-based and cluster-based indexes are often preferred for vector DBs. This topic covers types, query pipeline, trade-offs, and when to use trees versus other index families.

Summary

Space-partitioning trees split the space with hyperplanes (axis-aligned in KD-tree, random in Annoy); you descend at query time and may backtrack or search multiple leaves for ANN.
Compare to Voronoi (IVF) and HNSW: trees and IVF partition by geometry; graph indexes navigate by neighbor links.
KD-trees excel in low dimensions; ball trees suit certain metrics; random projection trees (Annoy) use many trees for robustness and better recall.
Trade-offs: interpretable structure and low memory per point vs. curse of dimensionality in high dimensions and costly or static updates.
Practical tip: prefer trees for low-dimensional or small datasets; for high-dimensional ANN at scale, IVF or HNSW usually offer better recall-latency and tuning.

Types and behavior

KD-tree: Splits on a single coordinate (axis-aligned hyperplane), alternating dimensions. Works well in low dimensions; in high dimensions many splits are needed and the tree becomes deep and less selective. Build is O(n log n) in the best case; query is O(log n) to reach a leaf when no backtracking is needed.

Ball tree: Partitions by enclosing points in balls and splitting balls; can be better for some metrics (e.g. non-axis-aligned structure). Each node stores a centroid and radius; query descends by distance to child balls. Often used in scikit-learn for nearest-neighbor and kernel methods.

Random projection trees (e.g. Annoy): Split on a random direction (linear combination of coordinates); building many trees and querying all of them improves recall. Common theme: the partition is geometric and hierarchical, so query cost is O(depth) to reach a leaf, plus the cost of searching the leaf (and possibly backtracking). Multiple trees reduce variance and improve robustness to unlucky splits.

Query pipeline

At index time: recursively partition the dataset until leaves are small enough (e.g. bucket size < L). At query time: start at the root, decide which child the query vector lies in (e.g. which side of the hyperplane), descend until a leaf is reached, then compute distances to points in that leaf. For ANN, implementations often allow backtracking (visit sibling nodes) or searching multiple leaves (e.g. in Annoy, aggregate over many trees) to improve recall at the cost of latency.

Pipeline summary: embed query → descend tree(s) → reach leaf/buckets → compute distances (and optionally backtrack) → return top-k. No neighbor links are stored; only the partition geometry drives the path.

Trade-offs and limitations

Pros: Clear geometric interpretation; no per-point links (lower memory than graph indexes); deterministic partition structure; good for low-dimensional data and small-to-medium scale. Cons: In high dimensions, the number of cells grows and backtracking can visit many nodes, eroding the speed advantage over brute force (curse of dimensionality). Inserts and updates often require rebalancing or full rebuilds; many ANN tree implementations are static.

Compared to IVF: both partition space; IVF uses a fixed set of cells (e.g. Voronoi) and probes nprobe cells, while trees use a hierarchical split. Compared to HNSW: trees do not store explicit neighbor edges; graph indexes can achieve higher recall at low latency in high dimensions because navigation is driven by nearest-neighbor links rather than hyperplane sides.

Role in vector databases

Pure space-partitioning trees are less common in production vector DBs than IVF or HNSW because high-dimensional data often makes tree pruning less effective and build/update more brittle. They still appear in libraries (e.g. Annoy, scikit-learn’s BallTree) and in hybrid or educational settings.

Understanding them clarifies the difference between “partition by geometry” (trees, IVF) and “navigate by neighbor links” (graph indexes). Practical tip: for low-dimensional or small datasets, a KD-tree or ball tree can be sufficient and easy to reason about; for high-dimensional ANN at scale, prefer IVF or HNSW and tune nprobe or efSearch for your recall-latency target.

Frequently Asked Questions

What is the main drawback of KD-trees for vectors?

In high dimensions, the number of cells grows quickly and backtracking can visit many nodes, so you lose the speed advantage over brute force (curse of dimensionality).

How do random projection trees differ from KD-trees?

KD-trees split on one coordinate at a time; random projection trees split on a random linear combination of coordinates. Multiple random trees (as in Annoy) improve robustness and recall.

Can space-partitioning trees support inserts?

KD-trees can be updated with care but rebalancing is costly. Many ANN tree implementations are static and require rebuild for large updates.

When would I choose a tree over IVF?

For low-dimensional data or when you want a simple, interpretable partition. For high-dimensional ANN at scale, IVF or HNSW usually give better recall-latency and easier tuning.