Warm-up time for in-memory indexes
An in-memory vector index (e.g. HNSW or IVF in RAM) often needs a warm-up period after load: pages are brought into CPU caches, OS page cache is populated, and branch-prediction and memory layout stabilize. During this period, latency can be higher and more variable than after the index is “warm.”
Summary
- An in-memory vector index (e.g. HNSW or IVF in RAM) often needs a warm-up period after load: pages enter CPU caches, OS page cache is populated, and branch-prediction and layout stabilize. During this period latency can be higher and more variable. See memory usage and loading from disk.
- Causes: cold caches; loading from disk (OS page cache not yet full); JIT/CPU warm-up; connection/pool warm-up. Mitigations: run a warm-up phase (representative queries before production); load balancers that delay traffic until after warm-up; baseline latency only after warm-up when stress testing or setting SLOs. Pipeline: load index, run warm-up queries, then serve. Practical tip: run 100–1000 representative queries after deploy before accepting traffic.
Causes of cold latency
Causes: (1) Cold caches—first touches to index pages cause cache misses; repeated queries touch the same regions and latency drops. (2) Loading from disk—if the index is loaded from disk at startup, the OS may not have all blocks in page cache yet; background read-ahead or explicit preloading can help. (3) JIT and CPU warm-up—some runtimes or native code paths optimize after a few invocations. (4) Connection and pool warm-up—first requests may pay for connection setup or pool initialization.
Mitigations
Mitigations: run a warm-up phase after deploy or restart—e.g. send a batch of representative queries (or a random sample) before marking the node ready for production traffic; use load balancers that delay sending traffic until health checks pass after warm-up; and baseline latency only after warm-up when stress testing or setting SLOs.
Pipeline: load index, run warm-up queries, then serve. Practical tip: run 100–1000 representative queries after deploy before accepting traffic.
Frequently Asked Questions
Why do in-memory indexes need warm-up?
After load, pages must be brought into CPU caches, OS page cache populated, and branch-prediction and memory layout stabilize. During this period latency can be higher and more variable than when the index is “warm.” See memory usage and loading from disk.
What causes cold latency?
First touches to index pages cause cache misses; repeated queries touch the same regions and latency drops. If the index is loaded from disk at startup, the OS may not have all blocks in page cache yet. JIT and connection/pool warm-up can also add to first-request cost.
How do I mitigate warm-up in production?
Run a warm-up phase after deploy or restart—e.g. send a batch of representative queries (or a random sample) before marking the node ready. Use load balancers that delay traffic until health checks pass after warm-up. See stress testing and load balancing.
When should I measure baseline latency?
Baseline latency and SLOs should be measured only after warm-up. When stress testing, run warm-up queries first so results reflect steady-state latency and QPS.