Distributed Systems & Scaling · Topic 163

Global secondary indexes for metadata

A global secondary index (GSI) on metadata is an index structure that allows efficient lookup or filtering by non-primary attributes (e.g. user_id, category) across the entire dataset, even when vectors are sharded by ID or by vector space.

Summary

A global secondary index (GSI) on metadata allows efficient lookup or filtering by non-primary attributes (e.g. user_id, category) across the entire dataset, even when vectors are sharded by ID or vector space.
Without a GSI, a query like “all vectors where tenant_id = X” may require scanning every shard and applying a metadata filter locally. A GSI maps index key to matching vector IDs (or shard + local IDs) so the coordinator can target only relevant shards and reduce scatter-gather cost.
Implementation: index updated on every insert/update/delete; often in consensus-backed store. High-cardinality or frequently changing metadata adds write amplification; build GSIs for attributes commonly used in pre-filters or routing. See range queries. Pipeline: write updates GSI, coordinator uses GSI to target shards. Practical tip: add GSIs only for hot filter attributes to limit write cost.

Why use a GSI

Without a secondary index, a query like “all vectors where tenant_id = X” may require scanning every shard and applying a metadata filter locally. A GSI maintains a mapping from index key (e.g. tenant_id) to the set of vector IDs (or shard + local IDs) that match, so the coordinator can target only the relevant shards or segments and reduce scatter-gather cost.

Trade-off: building and maintaining a GSI adds write amplification and storage. Use a GSI when the same attribute is frequently used in filters or routing; avoid GSIs for rarely used or very high-cardinality attributes unless the query pattern justifies the cost.

Implementation challenges

The index must be updated on every insert/update/delete and kept consistent across nodes; that often implies a distributed index (e.g. stored in the same consensus-backed metadata store or a dedicated index layer). For high-cardinality or frequently changing metadata, GSIs can add write amplification and storage overhead.

Pipeline: each write updates the GSI; at query time the coordinator consults the GSI to target only relevant shards. Practical tip: add GSIs only for hot filter attributes to limit write cost; see metadata cardinality and performance for how cardinality affects index size and update cost.

Frequently Asked Questions

What is a global secondary index (GSI) for metadata?

An index structure that allows efficient lookup or filtering by non-primary attributes (e.g. user_id, category) across the entire dataset, even when vectors are sharded. The coordinator can target only relevant shards instead of scanning all. See metadata filtering.

When do I need a GSI?

When you frequently filter by an attribute (e.g. tenant_id, category) and want to avoid scanning every shard. Without a GSI, “all vectors where tenant_id = X” requires scatter-gather with local metadata filter on each shard. A GSI maps key to matching IDs so only relevant shards are queried.

What are the trade-offs?

The index must be updated on every insert/update/delete and kept consistent across nodes; high-cardinality or frequently changing metadata adds write amplification and storage overhead. Build GSIs only for attributes commonly used in pre-filters or routing. See metadata cardinality and performance.

How is a GSI implemented in a distributed VDB?

Often stored in the same consensus-backed metadata store or a dedicated index layer. Must be consistent across nodes. See coordinator role for how queries use the GSI to route to shards.