Basic Fundamentals · Topic 3

Why can’t traditional SQL indexes (B-Trees) work for vectors?

B-trees (and other classic SQL indexes) are built for a single ordered dimension: they answer “equal to” and “in range” fast by keeping data in sorted order. A vector has many dimensions, and the query you care about is nearest neighbor—not “exact match” or “between A and B.” There is no single global order that makes nearest-neighbor search efficient, so B-trees don’t fit.

Summary

B-trees assume one-dimensional order: compare two values, go left or right; great for equality and range.
Vectors are high-dimensional; the main query is nearest neighbor, not exact match or range.
Lexicographic (or any single) order on vectors does not preserve similarity: two close vectors can be far apart in that order.
Curse of dimensionality: in high dimensions, simple space splits don’t create small, tight regions, so B-tree–style partitioning doesn’t prune well.
Vector DBs use HNSW, IVF, and other indexes built for similarity and geometry—hence the need for a vector database.

What B-trees are good at

A B-tree keeps keys in sorted order along a single dimension. Given a key, you traverse the tree by comparing: if the key is less than the pivot, go left; if greater, go right. That gives O(log n) lookups for exact match and efficient range scans (“all rows where column BETWEEN a AND b”). The whole design assumes that “order” is meaningful and that you can partition the key space along that one dimension so that entire subtrees can be skipped when they’re outside the range. That works perfectly for integers, dates, strings (lexicographically), and other one-dimensional or low-dimensional keys.

Hash indexes are another classic structure: they support equality lookups in constant time on average but do not support range queries or ordering. Neither B-trees nor hash indexes are designed to answer “which keys are nearest to this key?” in a high-dimensional space—that question is at the heart of vector search.

What vector search needs

For a vector, the query you care about is: “find the k vectors closest to this query vector” under a distance like cosine similarity or Euclidean distance. There is no single “sorted order” that aligns with “closest.” You could impose lexicographic order (first dimension, then second, etc.), but two vectors that are very similar in distance can have very different lexicographic positions—so the index cannot prune the search space the way a B-tree prunes for a range. So an index that relies on one-dimensional order cannot support efficient approximate nearest neighbor (ANN) search.

Even if you sort by the first dimension only, the “nearest” neighbor in full space might live in a different branch of the tree. B-trees excel when the query predicate aligns with the sort order; for nearest-neighbor in d dimensions, no single linear order captures proximity in all dimensions at once.

The curse of dimensionality

In high dimensions, volume grows exponentially. Most of the space is “in the corners”; points tend to be roughly the same distance from each other, and simple axis-aligned or one-dimensional splits don’t create tight, small regions. So even if you built a tree over vectors (e.g. by lexicographic order), the partitions wouldn’t correspond to “neighborhoods” in terms of similarity. That’s the curse of dimensionality: naive space partitioning fails.

Vector databases avoid this by using indexes designed for similarity—graph-based (HNSW) or cluster-based (IVF)—that exploit geometry and distance, not scalar order. Those structures organize the space so that nearby vectors are discoverable in a small number of steps. That’s why you need a vector database or a dedicated vector index, not a B-tree, for vector search at scale.

Why not just scan?

For small datasets, brute-force scan (compute distance to every vector) is fine. For millions or billions of vectors, scan is too slow and doesn’t scale. ANN indexes trade a small loss in recall for large gains in latency and throughput. They’re built on the fact that “nearest” is a geometric notion in the latent space of embeddings, not a one-dimensional order. So the right tool is a vector DB with an ANN index, not a relational DB with a B-tree.

Summary: B-trees and traditional SQL indexes are the wrong abstraction for “find similar vectors.” You need an index that respects distance or similarity in high-dimensional space—and that’s what vector databases and ANN indexes provide.

Frequently Asked Questions

Can I use a B-tree index on one dimension of my vector?

You could index a single dimension (e.g. the first component) for range filters on that dimension. That doesn’t solve “nearest neighbor in full space”—two vectors close in L2 or cosine can have very different first components. For full vector similarity you need a proper ANN index.

What about space-filling curves (e.g. Z-order) for vectors?

Space-filling curves map multi-dimensional points to one dimension while preserving some locality. They can help for low dimensions or specific workloads but generally don’t scale as well as graph or cluster-based ANN for high-dimensional embeddings. See comparing graph-based vs. cluster-based indexes.

Do any SQL databases support vector search?

Yes. Extensions like pgvector (PostgreSQL) add vector types and indexes (e.g. IVFFlat, HNSW). They’re useful for smaller scale or when you want vectors in the same DB as relational data. For very large scale or lowest latency, dedicated vector databases are optimized for this workload.

Why is “nearest” hard but “equal” easy in SQL?

“Equal” is a discrete predicate: you can hash or sort and do a direct lookup. “Nearest” is a continuous optimization over all points; without structure that reflects distance, you can’t prune. B-trees give structure for order, not for distance in high dimensions.