Database Internals & Storage · Topic 121

Storing “Payloads” (JSON) alongside vectors

A payload is the extra data you store with each point in a vector database: IDs, labels, timestamps, or the original text. Many VDBs represent this as JSON (or a key-value map), so you can filter at query time and return only the fields you need without a separate lookup. This topic covers why to store payloads, schema, and trade-offs.

Summary

Payload = extra data per point (IDs, labels, timestamps, original text); often JSON or key-value. Enables filtering and rich results without a separate store.
Storing payloads avoids round-trips; trade-off is storage—large payloads bloat the index. Store only filter fields + reference (e.g. doc_id); keep full content elsewhere. Often columnar or compressed for pre-filtering.
Schema: some VDBs allow schemaless JSON; others require defined schema for indexed fields. Indexed fields support Boolean and range filters; unindexed can be returned but not filtered. For RAG, payloads often hold chunk text and source ID.
Trade-off: convenience and low latency vs. storage and memory; keep payloads lean and index only fields used in filters.
Practical tip: for RAG store chunk text and source ID in payload; for generic search store filter attributes and a doc_id; avoid full document body unless needed.

Why store payloads with vectors

Storing payloads avoids a round-trip to another store: after nearest-neighbor search you get back the vector ID plus the payload (e.g. title, snippet, category). That simplifies application code and reduces latency. The trade-off is storage and memory: large payloads (e.g. full document text) bloat the index.

A common pattern is to store only what you need for filtering (e.g. category, date) and a reference (e.g. doc_id), and keep full content elsewhere. Payloads are often stored in columnar or compressed form for efficient pre-filtering and retrieval. Pipeline: insert point with vector + payload → stored together (e.g. in LSM or columnar); query returns top-k IDs + requested payload fields (or full payload).

Schema and indexing

Schema flexibility varies: some VDBs allow arbitrary JSON (schemaless); others expect a defined schema for indexed fields. Indexed payload fields can be used in Boolean and range filters. Unindexed payload fields can still be returned with results but may not be filterable. For RAG, payloads often hold chunk text, source ID, and maybe score—so the application can pass the right context to the LLM without a second query.

Trade-off: rich payloads improve developer experience and reduce round-trips but increase storage and can affect filter performance. Practical tip: index only fields you filter on; use metadata cardinality guidance when many unique values exist.

Frequently Asked Questions

Should I store full document text in the payload?

Usually no—it bloats the index. Store filter fields and a reference (e.g. doc_id); fetch full content from object store or DB when needed. For RAG, storing chunk text in the payload is common so you can return context without a second lookup.

Can I filter on any payload field?

Only on fields that are indexed. Unindexed fields can be returned with results but typically can’t be used in filter predicates. Check your VDB’s docs for which types support indexing and pre-filtering.

What is the difference between payload and metadata?

In practice they’re often the same: key-value or JSON stored with each point. “Payload” often means the data returned with search results; “metadata” emphasizes use in filtering. Both are stored per point and can be columnar for efficiency.

Do payloads affect vector search performance?

Large payloads increase storage and memory; pre-filtering on payload fields adds work. Keep payloads lean and index only the fields you filter on. See how metadata cardinality affects query performance.