Scalar Quantization (SQ): Moving from Float32 to INT8
Scalar quantization (SQ) converts each vector component from a full-precision type (e.g. Float32) to a smaller integer type (e.g. INT8 or UINT8). Each dimension is quantized independently, usually by mapping the range of values in that dimension to a fixed number of levels. The result is a large reduction in memory footprint and often faster distance computation, with a controllable trade-off between compression and accuracy.
Summary
- Per-dimension min-max or range mapping to e.g. [0, 255]; query quantized same way; distances on integers. Simpler than PQ (no codebooks). Often combined with IVF. Between raw float and binary for accuracy vs. speed.
- Pipeline: compute per-dimension (or per-vector) scale/offset at build; quantize stored vectors and query to INT8; compute distance on integers. Store scale/offset for dequantization if needed.
- Trade-off: 4× memory reduction vs. float32; small accuracy loss; integer ops often faster. Handling negative values (symmetric or offset) required for signed embeddings.
- How quantization reduces memory footprint and trade-off between compression ratio and accuracy apply; combining IVF and PQ (IVFPQ) uses PQ inside cells—SQ is an alternative there.
How SQ works
A common approach is per-dimension min-max or range-based quantization: compute min and max (or robust bounds) per dimension, then map the float value to an integer in [0, 255] for 8-bit. At search time, the query can be quantized the same way, and distances are computed on integers (e.g. L2 on INT8), which is faster and more cache-friendly than float. Some systems use asymmetric quantization (different scale/offset per dimension) or train the codebook per subspace. The main downside is that very low bit widths (e.g. 4-bit) can hurt recall; 8-bit scalar often gives a good balance and is widely supported (e.g. in Faiss).
Compared to PQ
SQ is simpler than product quantization (PQ): no codebooks or subvectors, just a scale and offset per dimension (or per vector). It’s often combined with IVF (IVF + SQ) for both reduced search space and smaller vectors in memory.
Practical tip: use symmetric INT8 for signed embeddings (e.g. [-128, 127]); for unsigned or non-negative, map [min, max] to [0, 255]. How quantization reduces memory footprint and accuracy vs. speed in quantization summarize the trade-offs; computation overhead of different distance metrics changes when using integer distance.
Frequently Asked Questions
How much memory does INT8 save vs. Float32?
4×: 1 byte per dimension vs. 4. For 768-d, ~3 KB vs. ~12 KB per vector.
Do I need to store scale/offset?
Yes, per dimension (or per vector) to dequantize or to compute distances. Overhead is small (e.g. 768 × 8 bytes for 768-d).
Can SQ handle negative values?
Yes. Use symmetric INT8 (e.g. [-128, 127]) or offset: map [min, max] to [0, 255] and store min/max. See handling negative values.
Is SQ faster than float for distance?
Often yes: integer ops and better cache utilization. SIMD can do more INT8 ops per cycle than float in many CPUs.