Indexing Algorithms - IVF & Quantization · Topic 96

The trade-off between compression ratio and accuracy

In vector indexes, compression (e.g. SQ, PQ, BQ) reduces memory footprint and often speeds up distance computation, but approximates vectors and can hurt recall and distance fidelity. The compression ratio vs. accuracy trade-off is central to choosing code size, number of subvectors in PQ, and whether to use quantization at all; Recall@K and recall-latency curves help quantify it for your workload.

Summary

Higher compression (fewer bits) → lower accuracy/recall in general. Plot recall and distance error vs. code size; choose operating point. See accuracy vs. speed in quantization.
Pipeline: pick code size (e.g. PQ m×log₂k bits, SQ bits per dim) → measure recall@K and MSE or relative distance error → tune until recall target is met within memory/latency budget.
OPQ and RQ improve accuracy at a given bit rate; when to prefer full precision depends on dataset size and latency. How quantization reduces memory footprint and measuring recall at K help quantify the trade-off.

How the trade-off appears

Stronger quantization (e.g. 4-bit vs. 8-bit scalar, or fewer PQ subvectors / smaller codebooks) shrinks memory and can speed up distance, but increases reconstruction error. That error can reorder nearest neighbors, so recall@K drops. The curve is dataset- and metric-dependent: you should measure recall and (if needed) mean squared error or relative distance error for a range of compression settings and pick the point that meets your recall target with acceptable memory and latency.

When to prefer full precision

When the dataset fits in RAM and latency is already good, full-precision (float32) or light scalar quantization may be best. When you need to scale to billions of vectors or run on memory- or I/O-bound hardware, heavier quantization (PQ, OPQ, or BQ) becomes necessary; then the trade-off is explicit and you tune for your recall and resource limits.

Practical tip: plot recall@K and (if useful) mean squared reconstruction error vs. code size for a few representative queries; choose the leftmost operating point that meets your recall target. Combining IVF and PQ (IVFPQ) and binary quantization (BQ) are examples where this trade-off is central.

Frequently Asked Questions

Can I have high compression and high recall?

Up to a point. OPQ and RQ improve accuracy at a given bit rate; beyond that, more bits (e.g. more PQ subvectors or 8-bit SQ) improve recall at the cost of memory.

How do I measure “accuracy” for quantization?

Common: recall@K (do true neighbors stay in top-K?), mean squared reconstruction error, or relative distance error (|d_approx − d_true| / d_true).

Does the trade-off depend on dimension?

Yes. Higher dimensions often need more bits per dimension (or more PQ subvectors) for the same recall; the exact curve is data-dependent.

When is binary quantization acceptable?

For first-stage retrieval (large candidate set, then re-rank with full precision), or when recall requirements are relaxed and speed/memory are critical.