Similarity Metrics (Mathematical Foundations) · Topic 56

Thresholding: How to define a “good” match score

A threshold is a cutoff on similarity (or distance) used to decide which results count as matches. There is no universal “good” value—it depends on your metric, your data, and your application’s tolerance for false positives and false negatives.

Summary

For cosine similarity (range −1 to 1), 0.7–0.8 are loose starting points; for L2, lower is better and threshold is scale-dependent. Best: labeled data, score distributions, then tune.
Use relative criteria (top-K, re-rank then threshold) or adaptive thresholds. Document normalized vs. unnormalized scores; keep thresholds in config.
No universal “good” value—depends on metric, data, and tolerance for false positives vs. false negatives. Recalibrate when changing embedding model or collection.
Pipeline: collect labeled pairs → score distributions → choose threshold (or use top-K / re-rank then threshold); store in config and revisit as data drifts.
Trade-off: fixed thresholds are simple but brittle to distribution shift; adaptive or relative thresholds are more robust but less interpretable.

Choosing a threshold by metric

For cosine similarity (range −1 to 1), values like 0.7 or 0.8 are often used as loose starting points for “relevant” text, but they vary by embedding model and domain. For L2 distance, lower is better; the threshold is a maximum allowed distance and is scale-dependent (e.g. on embedding dimension and normalization).

So the same numerical threshold rarely transfers across models or datasets. Best practice is to collect labeled query–document pairs (or use relevance judgments), compute score distributions for relevant vs. non-relevant pairs, and pick a threshold that balances precision and recall for your use case. Practical tip: plot score histograms for “relevant” and “non-relevant” pairs; set the threshold at a value that separates them reasonably (e.g. high precision with acceptable recall).

Relative and adaptive approaches

Alternatively, use relative criteria: take top-K by score and optionally apply a secondary filter, or use re-ranking and then threshold on the re-ranker’s score. Some systems use adaptive thresholds (e.g. top score minus a margin) when absolute scores are unstable.

Document whether your scores are normalized or unnormalized, and keep thresholds in config so you can tune them as your data and model change. Trade-off: fixed thresholds are simple but can degrade when score distributions shift; adaptive thresholds (e.g. relative to top score) are more robust but less interpretable. Normalized vs. unnormalized distance scores explains how score scale affects threshold choice.

Pipeline summary: define “good” with labeled data and score distributions; set a cutoff (or use top-K / re-rank then cutoff) and store it in config. Recalibrate when you change embedding model, collection, or metric. Impact of distance metrics on recall is separate—recall measures retrieval completeness; thresholding filters the returned set by score. Both matter for end-user quality.

Frequently Asked Questions

Is cosine > 0.8 always “good”?

No. It depends on model, domain, and task. Use labeled data to find the threshold that works for your precision/recall goals.

Can I use the same threshold for different embedding models?

Usually not. Score distributions differ; recalibrate per model (and often per collection).

What is an adaptive threshold?

One that depends on the query or result set, e.g. “top score − 0.1” or “top 10% of scores.” Useful when absolute scale is unstable.

Should I threshold before or after re-ranking?

Often after: get top-K from vector search, re-rank, then threshold on re-ranker score. Re-ranker scores are usually more interpretable.