Basic Fundamentals · Topic 14

The importance of IDs and Metadata

Every point in a vector database needs a unique ID so you can update, delete, or upsert it without duplicating. Metadata (payload) attaches attributes—e.g. category, source, timestamp—so you can filter “nearest neighbors where category = X” and return useful context (e.g. title, URL) with each hit instead of only the raw vector.

Summary

IDs: unique per point; enable update, delete, upsert; often your primary key for idempotent re-ingestion.
Metadata: filterable attributes (category, date, etc.) and display data (title, URL); enables filtered vector search and useful results.
Good metadata design affects semantic search and RAG; storing only vectors without IDs or metadata makes maintenance and app integration hard.

Why IDs matter

IDs are usually assigned by you (e.g. from your primary key or a stable document/chunk identifier) so that re-ingestion is idempotent: sending the same ID again overwrites the previous point. That’s important for model updates or corrected data—you can re-embed and re-insert without manually deleting first. IDs also let you delete or update a single point and join results back to your application’s data (e.g. fetch full document from your app DB by ID). Without stable IDs, you can’t reliably update or deduplicate.

Auto-generated IDs are fine when you never need to update by identity or when you maintain a separate mapping from your business IDs to the generated ones. For pipelines that re-run (e.g. nightly re-embedding), using your own stable ID is usually simpler and avoids orphaned or duplicate points.

Why metadata matters

Metadata enables filtered vector search: the index finds candidates by vector similarity, then applies filters (or the engine does pre/post filtering) so results satisfy both “near this vector” and “matches these conditions” (e.g. category = X, date > Y). Metadata also stores what to return to the user: title, URL, snippet, doc_id, chunk_id. In RAG, you typically store doc and chunk identifiers so you can fetch the right passage and source; for recommendations you might store product_id and category. Good metadata design affects semantic search and use cases: store enough to filter and to display or route.

Filterable fields should match how you query: tenant_id, category, date range, language. Display-only fields (title, snippet) don’t need to be indexed but add payload size. Some engines support indexing a subset of metadata fields for fast pre-filtering while allowing arbitrary JSON for the rest.

Schema and cardinality

Metadata can be fixed-schema (defined fields and types) or schemaless (flexible JSON). High-cardinality fields (e.g. unique IDs) are good for equality filters; low-cardinality (e.g. category, status) are good for filtering and query performance. See range queries on metadata and handling null metadata when designing payloads.

What if you store only vectors?

Storing only vectors without IDs or metadata makes it hard to maintain and use the index in a real application: you can’t update or delete by identity, you can’t filter, and you can’t return meaningful context. So IDs and metadata are not optional for production use—they’re part of the point model and the query workflow.

Frequently Asked Questions

Can I use a string as an ID?

Many vector DBs support string or UUID IDs in addition to integers. Use whatever is stable and unique in your system (e.g. doc_id or composite key) so upserts and joins are straightforward.

How much metadata can I attach to a point?

Product-dependent. There’s usually a size or field limit. Large payloads can increase storage and latency; keep payloads lean and put very large content in object storage, referenced by ID or URL in metadata. See storing payloads.

Does metadata affect vector search performance?

Yes. Filtering (pre or post) and metadata cardinality affect latency and recall. Indexes on metadata (e.g. for pre-filtering) and payload size also matter. Design metadata for your query patterns.

Can I change metadata without re-embedding?

Usually yes. Many systems allow updating metadata (payload) for an existing ID without changing the vector. That’s useful for correcting labels or adding fields. Updating the vector typically requires an upsert with the new vector and same ID.