Similarity Metrics (Mathematical Foundations) · Topic 46

Jaccard Similarity for sparse vectors.

Jaccard similarity is defined for two sets as the size of their intersection divided by the size of their union: J(A, B) = |A ∩ B| / |A ∪ B|. For sparse vectors, you treat the set of non-zero (or significant) dimensions as the set, so Jaccard measures overlap of “active” dimensions. It ranges from 0 (no overlap) to 1 (identical sets).

Summary

Natural for set-like or bag-of-words data; distance often 1 − J. For weighted sparse vectors, cosine or overlap coefficients are more common.
In vector DBs, Jaccard is less common than cosine or L2 for dense embeddings; used in sparse retrieval, duplicate detection, and hybrid setups. MinHash and LSH can approximate Jaccard at scale.
Jaccard distance 1 − J satisfies the triangle inequality for sets; Jaccard itself is a similarity (0 = no overlap, 1 = identical).
Pipeline: represent items as sets of non-zero dimensions (or tokens), compute J = |A∩B|/|A∪B| or use MinHash/LSH for large-scale approximate search.
Trade-off: exact Jaccard is simple but O(|A|+|B|); MinHash gives bounded-error approximation with sublinear time for large sets.

Set-like and sparse use cases

Jaccard is natural for set-like or bag-of-words data: document term sets, user interaction sets, or any sparse representation where presence/absence matters more than magnitude. The corresponding distance is often 1 − J (Jaccard distance).

It is not a metric in the strict sense when used on weighted vectors (e.g. TF-IDF) unless you binarize; for weighted sparse vectors, cosine or overlap coefficients are more common. Practical tip: use Jaccard when you care only about which dimensions are “on”; when weights matter (e.g. term importance), use cosine similarity instead.

Role in vector databases

In vector databases, Jaccard is less common than cosine or L2 for dense embeddings, but it appears in sparse retrieval, duplicate detection, and hybrid setups where one signal is set-based.

MinHash and LSH can approximate Jaccard quickly for large sets, enabling scalable set similarity search. Trade-off: exact Jaccard is O(|A|+|B|) for set size; MinHash gives bounded error with sublinear time. Locally sensitive hashing (LSH) for VDBs describes how LSH is used in vector and set search.

Practical tip: when combining set-based signals with dense vector search (e.g. hybrid search), use Jaccard or overlap on keyword/tag sets and cosine or L2 on embeddings; then fuse the two rankings (e.g. RRF or weighted combination). Metadata filtering can further restrict to a subset before or after Jaccard similarity.

When to choose Jaccard vs. cosine

Choose Jaccard when only presence/absence of dimensions (or tokens) matters—e.g. binary features, tag sets, or binarized bag-of-words. Choose cosine when vector components have weights (e.g. TF-IDF, embeddings); cosine uses magnitude, Jaccard ignores it. For hybrid search combining vector and keyword signals, Jaccard or overlap on keyword sets can be one of the signals alongside dense vector similarity.

Pipeline summary: represent documents or items as sets (e.g. token IDs or dimension indices), build an index that supports set or sparse similarity (or use MinHash signatures), and query with the same set representation. Not all vector DBs support Jaccard natively; some expose it for sparse or metadata-only use cases, or you compute Jaccard in application code after filtering by other criteria.

Frequently Asked Questions

How do I use Jaccard with sparse vectors?

Treat the set of non-zero (or above-threshold) dimension indices as the set; compute intersection and union sizes, then J = |A ∩ B| / |A ∪ B|.

Is Jaccard a metric?

1 − J (Jaccard distance) satisfies the triangle inequality for sets; J itself is a similarity. See mathematical properties of a metric space.

When to use Jaccard vs. cosine for sparse data?

Jaccard when only presence/absence matters. When weights matter (e.g. TF-IDF), cosine is usually better.

How does MinHash relate to Jaccard?

MinHash gives a fast approximation of Jaccard similarity; it’s used in large-scale set similarity and LSH pipelines.