Indexing Algorithms - IVF & Quantization · Topic 104

Disk-ANN: How to run ANN on SSDs

When the vector index is too large for RAM, disk-based ANN keeps the index on SSD and performs search by reading only the necessary pages. Disk-ANN (and the underlying Vamana graph) is designed for this: the graph is laid out so that most query-time access is sequential or localized, reducing random I/O. Combined with caching (e.g. hot nodes in RAM), this allows ANN on SSDs at billion-scale with latency dominated by SSD read bandwidth rather than full index load. It fits between GPU (index in GPU RAM) and distributed or tiered storage. This topic covers why SSD, how disk-ANN works, and practical tuning.

Summary

Index lives on SSD; query loads only graph nodes (and optionally vectors) needed for the search. Vamana gives a disk-friendly graph; layout and caching are key. Enables billion-scale without proportional RAM.
SSD offers large capacity at lower cost than RAM; the challenge is minimizing random I/O via graph layout and caching.
Pipeline: entry point → traverse graph, loading only visited nodes from disk → cache hot nodes in RAM → return top-k.
Trade-offs: no need to fit full index in RAM vs. higher latency than in-memory; tune block size, cache size, and parallelism.
Practical tip: align block size to SSD page size; size the cache for your working set; consider multiple SSDs for throughput.

Why SSD and not just more RAM?

At billion-scale, full-precision vectors and graph links can require hundreds of GB or TB. SSDs offer much larger capacity than typical server RAM at lower cost. The challenge is that random access on SSD is slower than RAM; so the index must be designed to minimize random reads (e.g. by clustering related graph nodes on disk and doing sequential scans where possible).

Use disk-based ANN when the index exceeds available RAM and you want to avoid the cost of a huge RAM machine or complex sharding. It sits between in-memory (lowest latency, limited scale) and fully distributed or tiered storage (vectors on object store, with different access patterns).

How Disk-ANN style systems work

The graph (e.g. Vamana) is built with a bounded out-degree so each node fits in a small number of disk blocks. At query time, you start from an entry point (or a small set), follow edges to neighbors, and load only the nodes you traverse. Layout algorithms try to place neighbors close on disk so that one read fetches many relevant nodes. A cache keeps frequently accessed nodes (e.g. top layers or popular regions) in RAM. Result: thousands of queries per second per SSD with recall comparable to in-memory graph search, at much higher scale.

Pipeline and trade-offs

Build: construct graph (e.g. Vamana) in memory or in a disk-friendly order → write nodes to SSD with layout optimized for locality. Query: load entry point from cache or disk → greedy walk, loading only visited nodes → merge with cache. Trade-off: you avoid loading the full index but pay per-query I/O; latency is dominated by SSD read bandwidth and cache hit rate.

Practical tips: match block size to SSD page/block alignment (e.g. 4KB); tune cache size to your working set (e.g. top of graph or hottest regions); use multiple SSDs for parallel I/O if QPS is high. IVF on disk is an alternative (sequential scan of a few lists per query); graph is optimized for fewer, more localized reads.

Frequently Asked Questions

What is the main bottleneck for disk ANN?

SSD read bandwidth and latency. Random reads are costlier than sequential; so graph layout and caching matter more than raw compute.

Can I use IVF on disk instead of a graph?

Yes. IVF lists can be stored on disk and read sequentially when probing clusters; that’s also disk-friendly. Graph (Vamana) is optimized for fewer, more localized reads per query.

How do I tune for SSD?

Match block size to SSD page/block alignment; tune cache size (e.g. how much of the graph to keep in RAM); and consider multiple SSDs for parallelism.

Is Disk-ANN a product or an algorithm?

“Disk-ANN” often refers to the research system and the idea of running Vamana on disk. The algorithm is Vamana; the engineering (layout, I/O, cache) is what makes it work on SSDs.