← All topics

Similarity Metrics (Mathematical Foundations) · Topic 45

Hamming Distance for binary vectors.

Hamming distance between two binary vectors (or bit strings) is the number of positions at which the bits differ. For vectors a and b with components in {0, 1}, it equals the number of indices i where aᵢ ≠ bᵢ—equivalently, the number of 1s in the XOR of the two vectors. It is the standard distance for binary codes and is central to binary quantization and many hashing-based methods.

Summary

  • Fast to compute: bitwise XOR + popcount. Used with learned or random binary codes to approximate similarity.
  • Locally sensitive hashing (LSH) and binary quantization reduce storage and speed search at the cost of accuracy. For full-precision vectors, L2 or cosine are typical.
  • Hamming is a true metric; after binarization (e.g. sign bit) you lose magnitude—use for extreme compression and low latency when some accuracy loss is acceptable.
  • Pipeline: binarize embeddings if needed, store binary codes, then search by Hamming (XOR + popcount); consider re-ranking with full-precision distance on a shortlist.
  • Trade-off: 32× compression vs. float32 and very fast distance, at the cost of recall; use for first-stage retrieval or when storage/latency dominate.

Computation and compactness

Binary vectors are extremely compact (one bit per dimension), so Hamming distance is fast to compute with bitwise XOR and popcount (count of set bits). Many systems use learned or random binary codes to approximate similarity: hash real-valued embeddings to bits, then retrieve by Hamming distance.

Techniques like locally sensitive hashing and binary quantization can reduce storage and speed up search at the cost of accuracy. Trade-off: you gain speed and compression (e.g. 32× smaller than float32 per dimension) but lose magnitude and some recall; practical tip is to use Hamming for a fast first stage and optionally re-rank with L2 or cosine on a shortlist.

Metric properties and when to use it

Hamming distance is a true metric (non-negative, symmetric, zero only when equal, and satisfies the triangle inequality). For non-binary vectors you can still use Hamming after binarization (e.g. sign of each dimension), though that throws away magnitude; for full precision, L2 or cosine are typically used instead.

When to use Hamming: when you need extreme compression (e.g. billion-scale vectors in memory), very low latency (popcount is highly optimized), or when your data is naturally binary (e.g. hashed fingerprints, binary features). When to avoid: when you need high recall and magnitude matters; then use full-precision L2 or cosine and consider quantization only for a fast first stage with re-ranking.

Pipeline and practical use

Typical pipeline: produce or binarize embeddings (e.g. sign of each dimension, or a learned binary layer), store only the binary codes in the index, and run k-NN by Hamming distance. For better accuracy, retrieve more candidates by Hamming then re-rank with full-precision similarity on a small subset. Binary quantization (BQ) and its speed advantages cover when binary search is worth the accuracy trade-off.

Practical tip: use a two-stage pipeline—retrieve top-k by Hamming (e.g. k = 500), then re-rank the shortlist with L2 or cosine on full-precision vectors to recover recall. Thresholding on Hamming distance is dimension-dependent (e.g. “at most 10 bits different” out of 256); define a good match score based on your binarization and data; see thresholding for defining cutoffs.

Frequently Asked Questions

How do I compute Hamming distance in code?

XOR the two bit vectors, then count set bits (popcount). Many CPUs have a popcount instruction; vector libraries use it for speed.

Can I use Hamming for real-valued embeddings?

After binarization (e.g. sign bit per dimension). You lose magnitude; for full precision use L2 or cosine. See when to use L2 vs. cosine similarity.

What is the relationship to Jaccard?

For sets represented as binary vectors (element present = 1), Hamming relates to set difference; Jaccard is intersection over union. See Jaccard similarity for sparse vectors.

Why use binary codes in vector search?

Extreme compression (1 bit/dim) and very fast distance (XOR + popcount). Useful when storage or latency is critical and some accuracy loss is acceptable.