Auto-scaling triggers for VDBs
Auto-scaling adds or removes compute (and sometimes storage) capacity based on metrics so that the vector database can handle traffic spikes without over-provisioning. Triggers are the rules or thresholds that decide when to scale up or down.
Summary
- Triggers decide when to scale: e.g. CPU, memory, QPS, latency (p95/p99), queue depth, or for indexing: job backlog / segment count.
- Scale-down needs tuning to avoid flapping: cooldowns, minimum cluster size, hysteresis (different thresholds for scale-up vs. scale-down).
- In compute-storage separated designs, query nodes scale independently; in sharded setups, scaling may mean more shards/replicas and rebalancing. Often integrated with Kubernetes HPA or cloud auto-scaling groups. Pipeline: metrics exceed threshold, scale policy triggers, add/remove nodes. Practical tip: use hysteresis (scale up at 70% CPU, scale down at 40%) to avoid flapping.
Common scale-up triggers
CPU utilization—scale out when average CPU exceeds a threshold (e.g. 70%). Memory—scale when memory pressure is high, especially for in-memory indexes. QPS / request rate—scale when queries per second exceed a target. Latency—scale when p99 or p95 latency degrades. Queue depth—scale when too many requests are waiting. For batch or indexing workloads, scale based on job backlog or segment count.
Scale-down and integration
Scale-down triggers (remove nodes when load drops) help control cost but must be tuned to avoid flapping: use cooldown periods, minimum cluster size, and hysteresis (different thresholds for scale-up vs. scale-down). In compute-storage separated designs, query nodes can scale independently; in traditional sharded setups, scaling may mean adding shards or replicas and rebalancing. Integration with Kubernetes HPA or cloud auto-scaling groups is common. Pipeline: metrics (CPU, QPS, latency) exceed threshold, scale policy triggers, add or remove nodes. Practical tip: use hysteresis (e.g. scale up at 70% CPU, scale down at 40%) to avoid flapping.
Frequently Asked Questions
What metrics should I use to trigger scale-up?
Common choices: CPU and memory utilization, QPS vs. target, p95/p99 latency, and queue depth. For indexing, use job backlog or segment count. Pick metrics that match your SLOs.
Why does scale-down need different tuning?
If scale-down uses the same threshold as scale-up, the cluster can “flap”—scale down, then immediately scale up again when load fluctuates. Use hysteresis (e.g. scale up at 70% CPU, scale down at 40%), cooldowns, and a minimum size to avoid thrashing.
Can query and indexing capacity scale separately?
In compute-storage separated architectures, yes: query nodes and indexing workers can have separate auto-scaling policies. In a single-node or tightly coupled sharded design, scaling often adds full nodes (query + index).
How do I integrate VDB auto-scaling with Kubernetes?
Use Kubernetes Horizontal Pod Autoscaler (HPA) or custom controllers that watch VDB metrics (or a metrics exporter) and scale the number of query/indexing pods. Set resource requests/limits and HPA min/max replicas to match your triggers.