Distributed Systems & Scaling · Topic 147

Sharding by ID vs. Sharding by Vector (Spatial sharding)

Sharding by ID (e.g. hash of point ID or key range) assigns points to shards regardless of their vector; every query must be sent to all shards and results merged (e.g. top-K across shards). Load is even and implementation is straightforward. Spatial sharding (by vector) assigns points to shards by region in vector space (e.g. using IVF-style clusters); a query can be routed to the nearest shards only, reducing work per query but requiring routing logic and rebalancing when data drifts. The choice depends on scaling goals and query patterns.

Summary

Sharding by ID (hash of point ID or key range): points assigned regardless of vector; every query goes to all shards, results merged (e.g. top-K across shards). Even load, straightforward. Spatial sharding (by vector): points assigned by region in vector space (e.g. IVF-style clusters); query routed to nearest shards only—less work per query but routing and rebalancing when data drifts.
Choice depends on scaling goals and query patterns. See coordinator merge logic and splitting vectors across nodes.
Pipeline: ID—send query to all shards, merge top-k. Spatial—route query to nearest shards by vector, merge top-k. Trade-off: ID is simpler and even; spatial reduces fan-out but needs routing and rebalancing. Practical tip: use ID unless query latency and fan-out cost justify spatial.

Comparing strategies

ID-based sharding: assign each point to a shard by hash(document_id) mod N or by key range. Queries are sent to every shard; each shard returns its local top-k; the coordinator merge-sorts by distance and returns the global top-k. Load is even across shards; implementation is straightforward. No routing logic is needed.

Spatial sharding: assign points to shards by region in vector space (e.g. cluster centroids as in IVF; each shard owns a subset of clusters). At query time, the coordinator determines which shards are nearest to the query vector (e.g. by querying cluster centroids) and sends the query only to those shards. Fewer shards per query mean lower latency and less merge work, but routing logic and rebalancing when data or query distribution drifts are required.

Routing and merge logic

The coordinator fans out the query to the relevant shards, collects top-k (or more) from each, and merges into a global top-k. For ID sharding, all shards are queried; for spatial, only the nearest shards. Merge logic and how many candidates to request per shard affect recall and latency. Pipeline: route to shards, run ANN per shard, merge by distance. Trade-off: ID is simpler and even; spatial reduces fan-out but needs routing and rebalancing. Practical tip: use ID unless query latency and fan-out cost justify spatial.

Frequently Asked Questions

What is sharding by ID?

Points are assigned to shards by hash of point ID or key range, regardless of their vector. Every query is sent to all shards; results are merged (e.g. top-K across shards). Load is even; implementation is straightforward. See sharding across nodes.

What is spatial (vector) sharding?

Points are assigned by region in vector space (e.g. using IVF-style clusters). A query can be routed to the nearest shards only, reducing work per query. Requires routing logic and rebalancing when data drifts. See coordinator role.

When to use ID vs. spatial sharding?

ID: when you want simple, even load and can afford to query all shards. Spatial: when you want to reduce shards per query and have routing/rebalancing capability. Depends on scaling goals and query patterns.

How does the coordinator merge results?

The coordinator fans out the query to shards, collects top-K (or more) from each, and merges into a global top-K (e.g. by distance). Merge logic and how many candidates to request per shard affect recall and latency.