Sharding: How to split a billion vectors across nodes
Sharding splits a large collection across multiple nodes so no single machine holds the full dataset. For billion-scale vectors, this is necessary to fit data in RAM or disk and to scale throughput. The main design choice is sharding by ID vs. by vector (spatial): ID-based sharding is simple and even but can force every shard to be queried; spatial sharding routes by vector space so each query hits fewer shards but rebalancing is harder. A coordinator typically fans out the query and merges results.
Summary
- Sharding splits a collection across multiple nodes so no single machine holds the full dataset; necessary for billion-scale and throughput. Main choice: sharding by ID vs. by vector (spatial).
- ID-based: simple, even load; every shard may need to be queried. Spatial: route by vector space, fewer shards per query; rebalancing harder. A coordinator fans out the query and merges results. See replication, hot shards, and billion-scale architectures.
- Pipeline: route query to shard(s) by key or vector, run ANN per shard, merge top-k (e.g. take global top-k from shard results). Trade-off: ID sharding spreads load evenly but every query hits all shards; spatial reduces shards per query but rebalancing is harder.
- Practical tip: start with ID-based sharding for simplicity; consider spatial when query latency and fan-out cost matter. Each shard can have replicas for availability.
Shard design and scale
Shard key selection determines how data is split: by document or vector ID (hash or range) for even distribution, or by vector space (e.g. clustering) so each shard owns a region of the space. For billion-scale vectors, no single node can hold the full index in RAM; sharding spreads data and allows parallel query execution. Each shard can have replicas for availability and read scaling; writes go to the primary, reads can hit replicas.
Pipeline: client sends query to coordinator; coordinator determines which shards to query (all for ID-based, or nearest shards for spatial); each shard runs ANN and returns top-k; coordinator merges results (e.g. merge-sort by score, take global top-k). Trade-off: ID-based sharding is simple and evens load but every query typically hits all shards; spatial reduces shards per query but rebalancing when data drifts is harder.
Replication, hot shards, and billion-scale
Replication per shard provides high availability and read scaling. Hot shards (disproportionate load from a popular tenant or region) may require rebalancing, splitting, or routing adjustments. See replication for high availability, managing hot shards, and overcoming RAM limitations for billion-scale vectors for details. Practical tip: start with ID-based sharding for simplicity; consider spatial when query latency and fan-out cost matter.
Frequently Asked Questions
Why shard a vector collection?
To fit data in RAM or disk and scale throughput; no single machine holds the full dataset. For billion-scale vectors, sharding is necessary. A coordinator fans out queries and merges results.
Sharding by ID vs. by vector (spatial)?
ID-based: simple, even load; every query typically hits all shards. Spatial: route by vector space, query nearest shards only; fewer shards per query but rebalancing when data drifts is harder.
How does replication fit with sharding?
Each shard can have replicas for availability and read scaling. Writes go to the primary; reads can hit replicas. See replication for high availability and managing hot shards.
What about hot shards?
Some shards may get disproportionate load (e.g. popular tenant or region). Managing hot shards covers rebalancing, splitting, and routing. See also load balancing and scaling.