← All topics

Performance, Evaluation & Benchmarking · Topic 185

Accuracy vs. Speed in quantization

Quantization compresses vectors (e.g. float32 → int8 or PQ codes) to reduce memory and speed up distance computation. There is a direct trade-off: more aggressive quantization usually improves speed and lowers memory but can reduce accuracy (recall) because distances become approximate.

Summary

  • More aggressive quantization → better speed and lower memory; coarser quantization → lower recall because distances are approximate.
  • Scalar quantization (SQ) often has minimal recall loss; PQ and binary quantization trade more accuracy for speed. Tune with compression vs. accuracy and OPQ.
  • Fewer bits = faster distance math (SIMD-friendly); PQ asymmetric distance (query full precision, DB as codes) avoids decoding all vectors.
  • Measure recall@k and latency at target k and throughput; choose the coarsest quantization that meets your accuracy SLO. Pipeline: benchmark SQ, PQ, BQ at your scale; pick by recall SLO. Practical tip: start with 8-bit SQ; move to PQ if you need more compression.

Accuracy side of the trade-off

Scalar quantization (SQ) preserves order well and often has minimal recall loss at 8-bit; product quantization (PQ) introduces more approximation—increasing the number of subquantizers or codebook size can improve recall at the cost of more computation. Binary quantization is fastest but can hurt recall on dense, high-dimensional vectors. Compression ratio vs. accuracy is a sliding scale: you can tune code size, PQ parameters, or use OPQ to better preserve distances.

Speed side of the trade-off

Fewer bits per dimension mean faster distance math (and often SIMD-friendly); PQ allows asymmetric distance computation (query in full precision, DB as codes) which is faster than decoding all DB vectors. In practice, measure recall@k and latency at your target k and throughput, then choose the coarsest quantization that still meets your accuracy SLO.

Pipeline: benchmark SQ, PQ, BQ at your scale; pick by recall SLO. Practical tip: start with 8-bit SQ; move to PQ if you need more compression.

Frequently Asked Questions

Which quantization gives the best accuracy with acceptable speed?

8-bit scalar quantization often has minimal recall loss and is fast. For larger scale, PQ or IVF+PQ with tuned codebook size and OPQ can preserve recall while staying fast. Always measure recall@k on your data.

Why is binary quantization so fast but sometimes inaccurate?

Binary quantization uses a single bit per dimension (e.g. sign of component), so distance is just popcount/XOR—very fast. On dense, high-dimensional vectors much information is lost, so recall can drop. It works better when vectors are already “binary-like” or when speed matters more than accuracy.

What is asymmetric distance in PQ?

In product quantization, the query vector is kept in full precision and compared against precomputed codebook distances. The database vectors are stored as codes only, so you don’t decode them—that reduces computation and improves latency and throughput.

How do I choose quantization in practice?

Set an accuracy SLO (e.g. recall@10 ≥ 0.95) and a latency budget. Benchmark recall@k and latency for SQ, PQ, and if needed BQ at your scale; pick the most aggressive (fastest, smallest) option that still meets the SLO. See also recall–latency trade-off.