Indexing Algorithms - IVF & Quantization · Topic 103

GPU-accelerated indexing (Faiss-GPU)

Faiss-GPU is the GPU backend of Faiss: it runs IVF, IVFPQ, and flat (brute-force) search on NVIDIA GPUs. Vectors and index structures live in GPU memory; distance computation and (where applicable) clustering and PQ lookups are parallelized across thousands of cores. This gives very high throughput for batch queries and large indexes that fit in GPU RAM, at the cost of GPU hardware and transfer overhead for small or single-query workloads. It complements cluster-based and disk-based designs. This topic covers what runs on GPU, pipeline, trade-offs, and when to choose GPU vs CPU.

Summary

Use Faiss-GPU for IVFFlat, IVFPQ, and flat indexes; build and search on GPU. Best for large batch size and index fitting in GPU memory. HNSW on GPU is less standard in Faiss; see in-memory vs. on-disk for context.
GPU kernels cover flat distance, IVF centroid assignment and list scan, and PQ/IVFPQ distance tables and code scan; K-means and codebook training can run on GPU.
Trade-off: very high throughput when index and batch fit in GPU RAM vs. transfer and hardware cost for small QPS or single-query latency.
If the index does not fit in GPU RAM, use sharding, CPU fallback, or disk-based solutions.
Practical tip: benchmark with your batch size and index size; GPU pays off for bulk build and high QPS batch search.

What runs on GPU

Faiss-GPU implements GPU kernels for: (1) brute-force distance computation (flat index), (2) IVF assignment and search (find nearest centroids, then scan list vectors), (3) PQ and IVFPQ (distance tables and code scan). Build steps like K-means for IVF and PQ codebook training can run on GPU. The index and vectors are stored in GPU memory; you copy data to the GPU once and then run many queries without CPU–GPU transfer per query, which is why batch size matters.

Pipeline: copy or build index on GPU → for each query batch, compute distances (and IVF/PQ steps) in parallel → return top-k per query. No per-query CPU–GPU round trip if data stays on GPU; that maximizes throughput.

When to use GPU vs CPU

GPU shines when you have many queries per second (batch) or a very large single batch; then parallelism pays off. For low QPS or single-query latency, CPU Faiss (e.g. IVFPQ, HNSW) or a dedicated vector DB may be simpler and avoid GPU cost.

If the index does not fit in GPU RAM, you need sharding, CPU fallback, or disk-based solutions. Trade-off summary: GPU gives highest throughput for batch and build when data fits; CPU or disk-based designs are better for single-query latency, very large indexes, or cost-sensitive deployments. Practical tip: measure throughput (queries per second) at your target batch size; if you rarely batch, CPU may be sufficient.

Frequently Asked Questions

Does Faiss-GPU support HNSW?

Standard Faiss-GPU focuses on IVF and flat; HNSW is primarily CPU in Faiss. Other libraries (e.g. RAFT, cuML) offer GPU graph ANN.

How do I get data onto the GPU?

Faiss-GPU APIs accept arrays; you copy from CPU to GPU (or build from data already on GPU). For very large indexes, build on CPU and move the index, or use multiple GPUs with Faiss’s multi-GPU tools.

What about multi-GPU?

Faiss supports multi-GPU via index sharding (split vectors across GPUs, query all, merge results) or replicated search. See Faiss documentation for IndexShards and related APIs.

Is Faiss-GPU a vector database?

No. It’s a library for index build and search. A full vector DB adds persistence, APIs, filtering, and often runs Faiss (or similar) as the search engine underneath.