Cache management for vectors
Cache management in a vector database decides which vector data and index structures stay in fast memory (RAM or SSD cache) and which are read from disk or tiered storage. Good caching reduces latency and disk I/O for repeated and hot nearest neighbor queries. This topic covers what gets cached, sizing, and distributed behavior.
Summary
- Caches hold: raw vectors or quantized codes (refinement), index nodes (e.g. HNSW, IVF chunks), metadata/payloads, and sometimes full query results. Policies often LRU.
- With mmap, the OS page cache caches hot file regions; the VDB may add an application-level cache for decoded vectors or index structures.
- Cache size trades off memory and hit rate. In distributed setups each node has its own cache; load balancing spreads heat. Key for predictable p99 latency when data exceeds RAM.
- Trade-off: larger cache improves hit rate and latency vs. memory cost; eviction policy (LRU, TTL) affects working-set behavior.
- Practical tip: profile hit rate and p99; size cache to cover the hot working set; use load balancing so heat is spread across nodes.
What gets cached
Caches can hold: (1) raw vectors or quantized codes for the “refinement” step after index traversal, (2) index nodes (e.g. HNSW graph nodes, IVF list chunks), (3) metadata or payloads for result assembly, and (4) sometimes full query results (result cache). Policies are often LRU or similar: evict least-recently-used pages when the cache is full.
With mmap, the OS page cache effectively caches hot file regions; the VDB may also maintain an explicit application-level cache for decoded vectors or index structures. Pipeline: query touches index and vectors → cache lookup for each needed page or block → on miss, read from disk and insert into cache → evict when full (e.g. LRU).
Sizing and distributed behavior
Cache sizing affects memory usage and hit rates: too small and you pay more disk I/O; too large and you may not have room for the full working set. In distributed setups, each node has its own cache; load balancing can help spread heat. Cache management is key to predictable p99 latency when the dataset exceeds RAM.
Trade-off: larger cache improves hit rate and lowers p99 but uses more memory; TTL or LRU eviction keeps the working set. Practical tip: monitor cache hit rate and p99 under production load; if p99 spikes, consider increasing cache size or improving locality (e.g. segment layout).
Frequently Asked Questions
What does a vector DB cache?
Typically: raw vectors or quantized codes used in the refinement step, index nodes (e.g. HNSW graph nodes, IVF list chunks), metadata or payloads, and sometimes full query results. Eviction is usually LRU or similar.
How does mmap relate to caching?
With memory-mapped files, the OS page cache holds hot regions of the mapped files. The VDB can also maintain an application-level cache for decoded vectors or index structures on top of that.
How do I size the cache?
Balance memory usage and hit rate: too small → more disk I/O and higher latency; too large → may not leave room for the working set or other processes. Profile hit rate and p99 under your workload.
In a distributed VDB, is the cache shared?
Usually each node has its own local cache. Load balancing spreads queries across nodes so that “hot” data gets cached on the nodes that serve it; there is typically no global shared cache.