Similarity Metrics (Mathematical Foundations) · Topic 50

Mahalanobis Distance.

Mahalanobis distance measures the distance between a point and a distribution (or between two points) in a way that accounts for the covariance of the dimensions. For a vector x, mean μ, and covariance matrix Σ, it is D_M(x) = √((x − μ)ᵀ Σ⁻¹ (x − μ)). When Σ is the identity, this reduces to Euclidean (L2); otherwise it scales and rotates the space so that equal Mahalanobis distance corresponds to equal probability under a Gaussian.

Summary

Useful when dimensions have different scales or are correlated (whitens the space). Common in anomaly detection; rarely the primary metric in vector DBs.
Requires estimating Σ (or Σ⁻¹); d×d cost is high for large d. Pre-transform with Σ⁻¹/² then use L2 for equivalent result. See custom distance functions for VDB support.
When Σ is the identity, Mahalanobis reduces to L2. For positive definite Σ, Mahalanobis is a true metric.
Pipeline: estimate Σ from data, compute Σ⁻¹/², transform vectors and query, then run standard L2 search in the transformed space.
Trade-off: principled, distribution-aware distance vs. O(d²) cost and lack of native ANN support; pre-whitening + L2 is the practical approach.

When Mahalanobis is used

Mahalanobis is useful when dimensions have different scales or are correlated: it effectively whitens the space. In anomaly detection and classification, it helps define “unusual” relative to the data distribution.

In vector databases and ANN, it is rarely used as the primary metric because (1) you need a good estimate of Σ (or Σ⁻¹), (2) the matrix is d×d and expensive for high d, and (3) most ANN indexes are built for L2 or inner product. So it appears more in statistical modeling and specialized retrieval than in general-purpose VDBs. Trade-off: Mahalanobis gives a principled, distribution-aware distance but at O(d²) storage and cost per comparison; pre-whitening + L2 avoids custom index support.

Using L2 in a transformed space

If you need scale- and correlation-aware distance, you can sometimes pre-transform vectors (e.g. multiply by Σ⁻¹/²) and then use L2 in the transformed space—that is equivalent to Mahalanobis in the original space.

Practical tip: compute Σ (or its inverse square root) from a sample of your data; apply the same transform at ingest and at query time. Then create a standard L2 collection on the transformed vectors. This way you get Mahalanobis semantics without requiring the VDB to support a custom metric. For very high d, consider diagonal Σ (per-dimension scaling only) to avoid the full d×d matrix.

When Mahalanobis equals L2

When the covariance Σ is the identity matrix (uncorrelated, unit variance), the formula reduces to standard Euclidean distance. So Mahalanobis generalizes L2 to non-identity covariance; many embedding spaces are already roughly whitened, in which case L2 is sufficient.

Pipeline summary: if you have a labeled sample, estimate Σ (or use a diagonal approximation for speed), compute the inverse square root, and apply it to all vectors at ingest and at query time. Create a standard L2 collection on the transformed vectors. Mathematical properties of a metric space apply to Mahalanobis for positive definite Σ; thresholding and recall evaluation then use the same principles as for L2 in the transformed space.

Frequently Asked Questions

When does Mahalanobis equal L2?

When the covariance Σ is the identity matrix (uncorrelated, unit variance). Then the formula reduces to standard Euclidean distance.

Why is Mahalanobis uncommon in vector DBs?

Cost of Σ⁻¹ (or storing/using it) and lack of built-in ANN support for this metric. Pre-whitening + L2 is a common workaround.

Is Mahalanobis a metric?

Yes (for positive definite Σ). It satisfies the usual metric axioms; see mathematical properties of a metric space.

Can I use Mahalanobis in my vector DB?

Only if the engine supports custom metrics or you pre-transform data and use L2.