Indexing Algorithms - IVF & Quantization · Topic 93

Random Projection (Johnson-Lindenstrauss lemma)

Random projection reduces the dimensionality of vectors by multiplying them by a random matrix. The Johnson-Lindenstrauss (JL) lemma guarantees that, with high probability, pairwise distances are preserved up to a small distortion when projecting into a lower-dimensional space of size logarithmic in the number of points. This is used in ANN and LSH to shrink vectors before indexing, reducing memory and speeding up search while keeping approximate nearest neighbor structure.

Summary

JL: n points in high dimension can be embedded into O(log n) dimensions with (1±ε) distance preservation. Random matrix (e.g. Gaussian, sparse) suffices. Used in LSH, Annoy, and dim reduction before indexing.
Pipeline: draw random matrix R (d×d’); project vectors y = R·x; build index on y. Query projected same way. No training; fast and theory-backed.
Trade-off: fast, no training; dimension d’ = O(ε⁻² log n) for (1±ε) preservation. Dimensionality reduction techniques (PCA, t-SNE) for visualization differ (PCA is data-dependent).

Statement and use

Formally, for any set of n points and any ε in (0,1), there exists a linear map into dimension O(ε⁻² log n) such that all pairwise distances are preserved within factor (1±ε) with high probability. In practice you multiply by a random matrix (e.g. i.i.d. Gaussian or a sparse random matrix for speed); the result is a lower-dimensional representation. ANN indexes can then be built on the projected vectors: neighbors in the original space tend to remain neighbors in the projected space, so search in the reduced space is a good proxy.

In vector databases

Random projection is used in LSH (e.g. random hyperplanes are a form of projection), in tree-based methods like Annoy (splits use random projections), and as a fast alternative to PCA when you need dimensionality reduction without a training pass. Sparse random matrices (e.g. Achlioptas) reduce computation while still satisfying JL-style guarantees.

Practical tip: use random projection when you need fast, training-free dimension reduction before building an index; impact of embedding model dimensionality on VDB performance and how quantization reduces memory footprint relate. Locally sensitive hashing (LSH) for VDBs and tree-based indexing (Annoy) use projection in their design.

Frequently Asked Questions

How many dimensions for JL?

Target dimension is typically O(ε⁻² log n). For n=1M and ε=0.1, that’s on the order of hundreds to a few thousand dimensions.

Random projection vs. PCA?

PCA is data-dependent and minimizes reconstruction error; JL is data-independent and preserves distances with high probability. JL is faster (no eigendecomposition) but dimension may be larger for same guarantee.

Do I need to store the random matrix?

Yes, to project new (query) vectors. The matrix is typically d×k (d = original dim, k = target dim); can be generated from a seed for reproducibility.

Does JL work for cosine distance?

JL is usually stated for L2. For normalized vectors, L2 and cosine are related; random projection can preserve approximate cosine structure when vectors are normalized before and after.