Indexing Algorithms - IVF & Quantization · Topic 90

The concept of “Codebooks” in PQ

In product quantization (PQ), a codebook is the set of centroid vectors for one subspace. The vector space is split into m subspaces (e.g. 8), and each subspace has its own codebook—typically 256 centroids (so each code is 8 bits). A codebook is learned (e.g. by K-means) on the subvectors of the training data in that subspace. At encode time, each subvector is replaced by the index of its nearest centroid in that subspace’s codebook; at query time, distances are computed between query subvectors and these codes via the same codebooks.

Summary

m codebooks, each k entries (e.g. 256), each entry = short vector of dim D/m. Shared across all DB vectors; each vector has m indices. OPQ learns a rotation before PQ to improve subspace quality.
Pipeline: learn codebooks per subspace (e.g. K-means on subvectors) → encode vectors as indices → query uses same codebooks for table-based distance.
Trade-off: larger k = better approximation, more codebook storage; codebooks are small vs. vector storage. How PQ compresses vectors into codes and product quantization (PQ) explained tie together.

Structure and storage

So you have m codebooks, each with k entries (e.g. 256). Each entry is a short vector of dimension D/m. The codebooks are shared across all database vectors: every vector uses the same set of codebooks but has its own list of m indices (codes). Storing the codebooks takes relatively little space (e.g. 8 × 256 × 16 floats for 8 subspaces of 16-D with 256 centroids); the big saving is that each database vector is just m bytes of indices instead of D floats. The quality of the approximation depends on how well the codebook centroids represent the distribution of subvectors in that subspace; OPQ improves this by learning a rotation before PQ so that the subspaces are more amenable to quantization.

Summary

In summary: the codebook is the “dictionary” for one subspace—its entries are the only allowed representative vectors for that part of the space. PQ compresses by storing only which dictionary entry (code) is used per subspace, and approximate distances are computed using these dictionary entries instead of the original vectors.

Practical tip: codebooks are learned at build time; if the data distribution shifts, consider rebuilding. Combining IVF and PQ (IVFPQ) shares one set of codebooks across clusters; how PQ compresses vectors into codes describes the encode/decode and search flow.

Frequently Asked Questions

How are codebooks learned?

Typically K-means on the subvectors of the training set: for each subspace, take all vectors’ j-th subvector and run K-means to get k centroids.

Are codebooks stored per cluster in IVFPQ?

Usually one set of codebooks for the whole index. Some variants use per-cluster codebooks for better accuracy at higher memory cost.

What if my data distribution changes?

Codebooks are fixed at build time. Significant distribution shift can hurt recall; consider rebuilding or using adaptive/residual quantization. See version drift.

How big is the codebook storage?

m × k × (D/m) × 4 bytes = D × k × 4 bytes for float centroids. For D=128, k=256: 128 KB. Small compared to the compressed vectors (e.g. 8 bytes × millions).