Global secondary indexes for metadata
A global secondary index (GSI) on metadata is an index structure that allows efficient lookup or filtering by non-primary attributes (e.g. user_id, category) across the entire dataset, even when vectors are sharded by ID or by vector space.
Summary
- A global secondary index (GSI) on metadata allows efficient lookup or filtering by non-primary attributes (e.g.
user_id,category) across the entire dataset, even when vectors are sharded by ID or vector space. - Without a GSI, a query like “all vectors where tenant_id = X” may require scanning every shard and applying a metadata filter locally. A GSI maps index key to matching vector IDs (or shard + local IDs) so the coordinator can target only relevant shards and reduce scatter-gather cost.
- Implementation: index updated on every insert/update/delete; often in consensus-backed store. High-cardinality or frequently changing metadata adds write amplification; build GSIs for attributes commonly used in pre-filters or routing. See range queries. Pipeline: write updates GSI, coordinator uses GSI to target shards. Practical tip: add GSIs only for hot filter attributes to limit write cost.
Why use a GSI
Without a secondary index, a query like “all vectors where tenant_id = X” may require scanning every shard and applying a metadata filter locally. A GSI maintains a mapping from index key (e.g. tenant_id) to the set of vector IDs (or shard + local IDs) that match, so the coordinator can target only the relevant shards or segments and reduce scatter-gather cost.
Trade-off: building and maintaining a GSI adds write amplification and storage. Use a GSI when the same attribute is frequently used in filters or routing; avoid GSIs for rarely used or very high-cardinality attributes unless the query pattern justifies the cost.
Implementation challenges
The index must be updated on every insert/update/delete and kept consistent across nodes; that often implies a distributed index (e.g. stored in the same consensus-backed metadata store or a dedicated index layer). For high-cardinality or frequently changing metadata, GSIs can add write amplification and storage overhead.
Pipeline: each write updates the GSI; at query time the coordinator consults the GSI to target only relevant shards. Practical tip: add GSIs only for hot filter attributes to limit write cost; see metadata cardinality and performance for how cardinality affects index size and update cost.
Frequently Asked Questions
What is a global secondary index (GSI) for metadata?
An index structure that allows efficient lookup or filtering by non-primary attributes (e.g. user_id, category) across the entire dataset, even when vectors are sharded. The coordinator can target only relevant shards instead of scanning all. See metadata filtering.
When do I need a GSI?
When you frequently filter by an attribute (e.g. tenant_id, category) and want to avoid scanning every shard. Without a GSI, “all vectors where tenant_id = X” requires scatter-gather with local metadata filter on each shard. A GSI maps key to matching IDs so only relevant shards are queried.
What are the trade-offs?
The index must be updated on every insert/update/delete and kept consistent across nodes; high-cardinality or frequently changing metadata adds write amplification and storage overhead. Build GSIs only for attributes commonly used in pre-filters or routing. See metadata cardinality and performance.
How is a GSI implemented in a distributed VDB?
Often stored in the same consensus-backed metadata store or a dedicated index layer. Must be consistent across nodes. See coordinator role for how queries use the GSI to route to shards.