Performance, Evaluation & Benchmarking · Topic 176

GPU vs. CPU for query serving

GPUs excel at parallel distance computation over large batches of vectors, while CPUs with SIMD are often better for low-latency, single-query or small-batch serving and for graph-based traversal (e.g. HNSW). The right choice depends on workload shape and cost.

Summary

GPUs excel at parallel distance computation over large batches; CPUs with SIMD are often better for low-latency single-query or small-batch serving and for graph-based traversal (e.g. HNSW). Choice depends on workload shape and cost.
GPU shines at high QPS with batchable queries or brute-force/IVF over millions of vectors. CPU better for sub-ms single-query latency, graph traversal, and avoiding GPU cost. Many VDBs are CPU-only for serving and use GPU for index building. Benchmark with your query pattern and cost per query. Pipeline: choose by workload shape (batch vs. single-query). Practical tip: start with CPU + SIMD; add GPU if batch QPS justifies it.

When GPU vs. CPU

GPU-based search (e.g. Faiss-GPU, RAFT) shines when you have high QPS with batchable queries or when doing brute-force or IVF-style search over millions of vectors; latency can be low if the batch is large enough to hide GPU launch and transfer overhead. CPUs are typically better for sub-millisecond single-query latency, for index types that are hard to parallelize on GPU (e.g. graph traversal with irregular memory access), and when you want to avoid GPU cost and complexity.

Pipeline: choose by workload shape (batch vs. single-query). Practical tip: start with CPU + SIMD; add GPU if batch QPS justifies it.

Production practice

In practice, many production VDBs are CPU-only for serving and use GPU mainly for index building or for dedicated batch inference. Benchmark both with your actual query pattern (single vs. batched, p99 requirements) and factor in cost per query and operational overhead.

Frequently Asked Questions

When should I use GPU for vector search?

When you have high QPS with batchable queries or brute-force/IVF over millions of vectors. GPU-based search (e.g. Faiss-GPU, RAFT) can have low latency if the batch is large enough to hide launch and transfer overhead. See GPU-accelerated indexing.

When is CPU better than GPU for serving?

For sub-millisecond single-query latency; for index types hard to parallelize on GPU (e.g. graph traversal with irregular memory access); when avoiding GPU cost and complexity. CPUs with SIMD (AVX-512, NEON) are often better for low-latency, small-batch serving.

Can I use GPU for indexing and CPU for serving?

Yes. Many production VDBs are CPU-only for serving and use GPU mainly for index building or dedicated batch inference. Benchmark both with your actual query pattern (single vs. batched, p99 requirements) and cost per query.

How do I choose between GPU and CPU?

Benchmark both with your actual query pattern (single vs. batched, p99 requirements). Factor in cost per query and operational overhead. See latency, QPS, and recall–latency trade-off.