Overcoming RAM limitations for billion-scale vectors
At billion-scale, holding all vectors and index structures in RAM is often impractical (e.g. 1B × 768 dimensions × 4 bytes ≈ 3 TB). Vector databases overcome RAM limits by combining on-disk indexes, quantization, sharding, and tiered storage so that only hot data and index metadata need to fit in memory.
Summary
- At billion-scale, full index in RAM is often impractical (e.g. 1B × 768d × 4 bytes ≈ 3 TB). Combine on-disk indexes, quantization, sharding, and tiered storage so only hot data and index metadata need to fit in memory.
- Disk-ANN: index on SSD, stream pages touched during search, cache hot regions. Quantization (SQ, PQ) shrinks size 4–8×+. Sharding: split across nodes, coordinator merges. Cold data on object storage.
- Goal: keep latency and recall acceptable; accept more disk I/O and recall/speed trade-off (e.g. efSearch, coarser quantization); design for horizontal scaling and distributed index building.
- Pipeline: query hits coordinator → fan-out to shards or disk index → each shard/search returns candidates → merge and optionally refine; cache and quantization reduce per-node footprint.
- Practical tip: start with memory usage per million vectors to size; combine disk index + quantization for single-node scale, then add sharding when QPS or capacity requires.
Strategies: disk indexes, quantization, sharding, tiering
Disk-ANN and similar designs keep the index on SSD and stream in only the pages touched during search, with caching for hot regions. Scalar and product quantization shrink vector size by 4–8× or more, so more vectors fit in the same RAM or disk bandwidth. Sharding splits the dataset across nodes so each node handles a subset; the coordinator merges results. Cold data can sit on object storage and be probed less frequently or with a separate index.
Pipeline: client sends query to coordinator → coordinator fans out to shards (or single-node disk index); each shard runs ANN and returns candidates → coordinator merges (e.g. k-way merge or RRF) and returns top-k. Quantization and caching reduce the amount of data each node must hold or read.
Trade-offs and design goals
The goal is to keep latency and recall acceptable while scaling beyond a single machine’s RAM. That typically means accepting more disk I/O and some recall/speed trade-off (e.g. higher efSearch or coarser quantization) and designing for horizontal scaling and distributed index building.
Trade-off: scale and cost vs. latency and recall; disk and quantization add latency and may reduce recall unless tuned. Practical tip: benchmark at target scale with your dimension and distance; use memory usage per million vectors and build-time metrics to plan capacity.
Frequently Asked Questions
How much RAM do I need for 1B vectors?
Raw float32: 1B × 768 × 4 bytes ≈ 3 TB—usually not feasible on one node. With PQ or SQ (4–8× compression), sharding across many nodes, and tiered storage, each node may hold a fraction in RAM and the rest on disk or S3. See memory usage per million vectors to scale.
Does billion-scale always mean lower recall?
Not necessarily, but you often trade something: more efSearch (slower), coarser quantization (smaller but less accurate), or probing more IVF clusters. With Disk-ANN and good caching, you can keep recall high at the cost of latency and infrastructure.
Sharding vs. single-node disk index?
Sharding spreads load and memory across nodes; each node serves a subset. Single-node disk index (e.g. Disk-ANN) keeps everything on one machine’s SSD. Use sharding when one node can’t hold the index or handle QPS; use single-node when the dataset fits and you want simpler ops.
What about indexing time at billion-scale?
Building an index over 1B vectors is CPU and I/O intensive. Distributed index building parallelizes across nodes; incremental or streaming builds avoid full rebuilds. Expect hours or more for a full build depending on index type and hardware; see measuring index build time.