Indexing Algorithms - IVF & Quantization · Topic 92

Locally Sensitive Hashing (LSH) for VDBs

Locally Sensitive Hashing (LSH) is a family of hash functions that map similar vectors to the same or nearby buckets with high probability, and dissimilar vectors to different buckets. In a vector database, LSH is used to reduce the search space: instead of comparing the query to every vector, you only search within the buckets that the query hashes to, giving ANN behavior with tunable recall/speed trade-offs.

Summary

Variants: random hyperplanes (sign of dot product), simhash, etc. More hash tables/bits → higher recall, more memory and work. Compare to HNSW and IVF for practice.
Pipeline: build L hash tables with random hash functions → hash each vector into buckets → at query, hash query, union candidates from hit buckets, optionally re-rank with exact distance.
Trade-off: more tables improve recall but increase memory and query time; LSH is useful for set similarity (MinHash, Jaccard) and in some ANN pipelines. Random projection (Johnson-Lindenstrauss) underlies many LSH schemes.

How LSH works

You build several hash tables, each with a different random LSH function. To index a vector, you hash it into one or more buckets in each table. To query, you hash the query into buckets and union the candidate vectors from those buckets, then (optionally) re-rank with exact distance. Similar vectors collide with higher probability; tuning the number of tables and the number of bits per hash controls the trade-off between recall and the size of the candidate set.

Variants and parameters

Common LSH for cosine or L2 includes random hyperplane hashing (sign of random projection) and simhash. Increasing the number of hash tables or repeating with different seeds improves recall but increases memory and query time. In practice, HNSW and IVF often give better recall–latency curves for vector search; LSH remains useful for set similarity (e.g. MinHash for Jaccard) and in some hybrid or legacy systems.

Practical tip: for vector ANN, prefer HNSW or IVF unless you have a specific need for LSH (e.g. set overlap, existing LSH infrastructure). Random projection (Johnson-Lindenstrauss lemma) and Hamming distance for binary vectors relate to LSH-style hashing. Search space reduction techniques and comparing graph-based vs. cluster-based indexes describe alternatives.

Frequently Asked Questions

What is “locally sensitive”?

Nearby points (in the distance you care about) have high probability of hashing to the same bucket; far points hash to different buckets with high probability.

Why multiple hash tables?

One table can miss similar pairs (they land in different buckets). Multiple tables increase the chance that at least one table hashes both to the same bucket.

Does LSH work for inner product?

Yes. Signed random projection (sign of a·r for random r) is LSH-like for maximum inner product search (MIPS); there are formal LSH families for L2 and cosine.

When to use LSH vs. IVF?

IVF is cluster-based and often gives better recall for vector search; LSH is simpler to implement and can be good for very high dimensions or when you need probabilistic guarantees. See graph vs. cluster indexes.