Indexing Algorithms - HNSW · Topic 67

Memory consumption of HNSW indexes

HNSW memory has two main parts: (1) storing the vectors themselves (e.g. float32 → 4 bytes per dimension), and (2) storing the graph—each node has up to M neighbor IDs per layer, and each vector appears on multiple layers. So total memory grows with dataset size, dimensionality, M, and the distribution of node levels (how many nodes exist at each layer).

Summary

Vector storage: n × d × 4 bytes (float32). Graph: n × M × (avg layers) × id size. Quantization and lower M or pruning reduce footprint.
Billion-scale can exceed RAM → on-disk or hybrid; see memory per million vectors and comparison with IVF.
Graph overhead is on the order of M × O(log n) × 4–8 bytes per vector for neighbor IDs; for large M and high d, the graph can be a significant fraction of total memory.
Pipeline: no separate “graph build” memory spike—memory grows incrementally as vectors and their links are added; query time adds temporary efSearch-sized candidate list per query.
Trade-off: lowering M or using quantization saves memory but can hurt recall; for very large scale, consider disk-backed HNSW or IVF-based indexes.

Breaking down the footprint

Roughly: vector storage is n × d × 4 bytes for n float32 vectors of dimension d. Graph overhead is on the order of n × M × (average number of layers) × size_of_id. The average number of layers per node is typically O(log n) with the usual level distribution, so each node stores roughly M × O(log n) neighbor IDs. With 4- or 8-byte IDs, that adds tens to hundreds of bytes per vector depending on M and n.

For large M and high dimensions, the graph can be a significant fraction of total memory. Quantization (e.g. int8 or PQ) reduces vector size; lowering M or using pruning reduces graph size at the cost of recall.

Vector storage vs. graph storage

Vector storage dominates when d is large and M is modest; graph storage grows with M and the number of layers. For 1M vectors of dimension 384 (float32), vectors alone are about 1.5 GB. With M=16 and ~20 layers per node on average (for skip-list style), graph links might add on the order of 1M × 16 × 20 × 4 bytes ≈ 1.3 GB—so the graph can be comparable to vector size. For higher M or lower d, the graph share increases.

Scale and alternatives

For billion-scale vectors, in-memory HNSW can exceed available RAM, which motivates on-disk or hybrid designs and tiered storage. Memory per million vectors is a common benchmark; HNSW typically uses more memory than IVF-style indexes but offers better recall for a given query latency.

When RAM is a hard limit, options include: quantizing vectors (int8, PQ) to shrink vector storage; reducing M or using aggressive pruning; moving to disk-backed HNSW (load segments on demand); or using IVF (e.g. IVFPQ) which trades some recall for lower memory and better scalability to billions.

Pipeline and query-time memory

Index memory grows as vectors are inserted: each new node adds its vector and its neighbor lists. There is no separate “graph build” phase that doubles memory; growth is incremental. At query time, each search allocates a temporary candidate list of size efSearch (and possibly other buffers), so peak memory during a query is index size plus O(efSearch) per concurrent query. For high throughput, limit concurrency or efSearch to avoid memory spikes.

Trade-offs and practical tips

Trade-off summary: memory scales with n, d, and M; reducing any of these or using quantization saves memory but can hurt recall or build cost. Practical tip: estimate footprint before scaling—n × (d × 4 + M × 20 × 4) bytes for a rough in-memory HNSW with float32 and default layer distribution. If the result exceeds available RAM, plan for quantization, disk-backed index, or IVF-based design.

Pipeline summary: when designing for scale, compute vector + graph footprint first; use memory usage per million vectors as a reference. How quantization reduces memory footprint and trade-off between compression ratio and accuracy apply if you quantize; HNSW on disk vs. HNSW in RAM and overcoming RAM limitations for billion-scale vectors describe alternatives when in-memory HNSW does not fit.

Frequently Asked Questions

How much does the graph add over raw vectors?

Typically on the order of M × (log n) × 4–8 bytes per vector (neighbor IDs). For M=16 and 1M vectors, that’s a few hundred MB extra.

Can I use quantized vectors in HNSW?

Yes. Store int8 or PQ codes instead of float32; distance computations use the quantized representation. Saves memory and can speed up search.

Does HNSW memory scale linearly with n?

Roughly yes: vectors scale as n×d; graph as n×M×O(log n). Constants and layer distribution affect the exact factor.

How do I reduce HNSW memory?

Lower M, use quantization for vectors, or move to disk-backed or IVF-based design for very large scale.