How quantization reduces memory footprint
Storing vectors in full float32 uses 4 bytes per dimension—expensive at scale. Quantization represents vectors with fewer bits: scalar quantization (e.g. INT8) halves that; product quantization stores short codes and shared codebooks; binary uses 1 bit per dimension. This directly reduces memory footprint, so more vectors fit in RAM, index size drops, and systems can scale to billion-scale vectors without proportionally more hardware.
Summary
- Concrete: 1M vectors × 768-d float32 ≈ 3 GB; INT8 ≈ 0.75 GB; PQ (8×8 bit) ≈ 8 MB for codes + codebooks. Enables mmap and tiered storage; balance with accuracy trade-off.
- Pipeline: choose quantization (SQ/PQ/BQ) → encode vectors at ingest → store codes; at query compute distance on codes. Smaller codes mean more vectors in RAM or on disk per GB.
- Trade-off: lower footprint often comes with approximate distances and possible recall loss; see trade-off between compression ratio and accuracy. Combining IVF and PQ (IVFPQ) uses PQ to shrink in-cell vectors.
By the numbers
For 1 million vectors of dimension 768: float32 is 1M × 768 × 4 = ~3 GB. INT8 scalar is ~0.75 GB. PQ with 8 subvectors and 8 bits per code is 1M × 8 = 8 MB for codes plus codebook storage (typically a few MB). Binary is 1M × 96 bytes = ~96 MB. So quantization can reduce vector storage by 4× (SQ) to 30× or more (PQ, BQ), which directly lowers RAM needs and can make the difference between fitting an index in memory and having to use disk or distributed sharding.
Interaction with storage and tiering
Smaller footprint also improves cache utilization and makes memory-mapped files and tiered storage (e.g. hot data in RAM, cold on SSD or S3) more practical. The accuracy trade-off remains: measure recall and latency for your target compression level.
Practical tip: start with concrete numbers for your dimension and scale (e.g. 1M × 768-d); compare float32, INT8, PQ, and BQ sizes. Trade-off between compression ratio and accuracy and overcoming RAM limitations for billion-scale vectors guide when to quantize and how much.
Frequently Asked Questions
Does quantization reduce index structure memory too?
Vector storage is usually the bulk. Graph (HNSW) or cluster (IVF) overhead is separate; PQ/SQ shrink only the vector storage, not the graph or inverted lists (though list entries can point to smaller codes).
How much do codebooks add?
For PQ: m × k × (D/m) × 4 bytes; e.g. 8×256×96 floats ≈ 0.8 MB for 768-d. Usually negligible vs. millions of vectors.
Can I mix quantized and full-precision?
Some systems store quantized for search and keep full-precision for re-ranking or in a separate store; that increases total memory but can improve quality.
What about metadata and IDs?
IDs and payloads are separate from vector storage. Quantization only reduces the vector array size; metadata footprint is unchanged.