Indexing Algorithms - IVF & Quantization · Topic 100

Tree-based indexing (e.g., Annoy)

Tree-based ANN indexes partition the vector space using trees. Annoy (Approximate Nearest Neighbors Oh Yeah) builds multiple binary trees by splitting on random projections; at query time it descends the trees to reach leaf nodes and then searches within a small candidate set. This is a form of search space reduction that is simple, memory-efficient, and supports static or rarely updated data. Space-partitioning trees generalize the idea; in practice, HNSW and IVF often achieve better recall-latency trade-offs for vector databases.

Summary

Annoy: build n_trees binary trees (random hyperplane splits), query by descending to leaves and merging candidate sets; search_k controls candidate size. Comparison to graph and cluster-based.
Pipeline: build—repeatedly split on random hyperplanes until leaves are small; query—descend each tree to a leaf, merge candidates, run k-NN on candidate set. More trees → better recall, more memory and query time.
Read-optimized, supports mmap from disk; good for static or batch-updated data. For dynamic updates and often better recall–latency, HNSW or IVF are preferred.
Trade-off: simple and memory-efficient vs. less flexible for updates; in high dimensions may need more trees for same recall.
Practical tip: use Annoy for static or batch-updated data and when you want disk-backed, mmap-friendly indexes; for dynamic data and often better recall–latency, prefer HNSW or IVF. Random projection (Johnson-Lindenstrauss) underlies the splits.

How Annoy works

Each tree is built by repeatedly splitting the data: pick two points, define a hyperplane midway (or via random projection), and assign points to left or right. Recursively split until leaves have at most a few vectors. At query time, you descend each tree (following the side of the hyperplane the query lies in) to a leaf, collect the vectors in the leaves you reached across all trees, and run k-NN within that candidate set (or merge and sort by distance). The parameter search_k controls how many candidates are considered (e.g. search more trees or larger leaves). More trees generally improve recall but increase memory and query time.

When to use tree-based indexes

Annoy is read-optimized and supports memory-mapping the index from disk; it is a good fit for static or batch-updated data and simple deployments. For dynamic updates and often for better recall at low latency, HNSW or IVF are preferred in production vector DBs. Tree-based indexing remains useful for teaching, prototyping, and when you want minimal dependencies or disk-backed static indexes.

Pipeline and trade-offs

Build pipeline: for each tree, pick random points, define hyperplane, split data, recurse until leaf size limit. Query pipeline: descend each tree to a leaf, collect leaf vectors, merge and compute k-NN (or sort by distance). Trade-off: more n_trees and higher search_k improve recall but increase memory and query time; Annoy trades simplicity and disk-friendliness for less flexibility on updates and often lower recall than HNSW/IVF at similar latency.

Practical tip: increase n_trees and search_k until recall@K meets your target; then compare latency and memory to HNSW and IVF on the same data. Search space reduction techniques and space-partitioning trees give broader context; the recall-latency trade-off curve helps compare methods.

Frequently Asked Questions

What are n_trees and search_k in Annoy?

n_trees: number of independent trees built. search_k: at query time, search_k = n_trees × k × some factor; higher search_k → more candidates, better recall, slower query.

Does Annoy support incremental insert?

Annoy is designed for static indexes; rebuilding is the standard way to add data. Some forks or wrappers support limited updates by rebuilding in the background.

How does Annoy handle high dimensions?

Random projection splits can still partition the space, but in very high dimensions the “curse of dimensionality” can reduce the effectiveness of axis-like splits; Annoy may need more trees for the same recall.

Can I use Annoy with cosine?

Yes. Normalize vectors and use Euclidean distance in Annoy (equivalent to cosine for normalized vectors), or use an angular distance variant if the implementation supports it.