Throughput (Queries Per Second – QPS)
Throughput, often expressed as queries per second (QPS), is the maximum rate at which the vector database can complete queries while still meeting latency or recall targets. It is a key metric for capacity planning and auto-scaling.
Summary
- Throughput (QPS): maximum rate at which the VDB can complete queries while meeting latency or recall targets. Key for capacity planning and auto-scaling.
- Measure: run load test, increase rate until latency degrades (e.g. p99 > 200 ms) or system drops/queues; sustainable QPS is just before that. Depends on hardware, index type/params, load balancing. Scaling out increases QPS until coordinator or storage bottlenecks. See cost per query and recall-latency trade-off. Pipeline: ramp load, measure latency, find knee. Practical tip: run stress tests at 2x expected peak QPS to validate headroom.
Measuring QPS
To measure QPS: run a load test that sends queries at an increasing rate until latency degrades beyond a threshold (e.g. p99 > 200 ms) or the system starts dropping or queuing requests. The sustainable QPS is the rate just before that point. Throughput depends on hardware (CPU, memory, disk), index type and parameters (e.g. HNSW ef), and load balancing across nodes.
Pipeline: ramp load, measure latency, find knee. Practical tip: run stress tests at 2x expected peak QPS to validate headroom.
Scaling and cost
Scaling out (adding query nodes or replicas) typically increases QPS linearly until the coordinator or storage becomes the bottleneck. Rate limiting and caching affect how much traffic reaches the index. For cost analysis, cost per query (CPQ) is often computed from QPS, node count, and cloud spend. Compare with the recall-latency trade-off curve: higher recall or lower latency usually means lower max QPS for the same hardware.
Frequently Asked Questions
What is QPS for a vector database?
Queries per second (QPS)—the maximum rate at which the VDB can complete queries while still meeting latency or recall targets. Key for capacity planning and auto-scaling.
How do I measure sustainable QPS?
Run a load test that sends queries at an increasing rate until latency degrades beyond a threshold (e.g. p99 > 200 ms) or the system starts dropping or queuing. Sustainable QPS is the rate just before that point. Depends on hardware, index type (e.g. HNSW ef), and load balancing.
How does scaling out affect QPS?
Adding query nodes or replicas typically increases QPS linearly until the coordinator or storage becomes the bottleneck. Rate limiting and caching affect how much traffic reaches the index. See cost per query.
How does recall/latency affect max QPS?
Higher recall or lower latency usually means lower max QPS for the same hardware—see recall-latency trade-off curve. HNSW and IVF parameters (ef, nprobe) control this trade-off.