Embeddings & Data Prep · Topic 34

Storing raw data vs. storing only vectors

A vector database can store only the embedding vectors (and IDs), or it can store vectors plus metadata and even the original raw data (e.g. text, image URLs). The choice affects storage cost, latency, and how you build your application. Most production systems use vectors plus metadata, with full content either in the VDB payload or in a separate store keyed by ID.

Summary

Vectors only: VDB returns IDs; you look up content elsewhere; small/fast VDB, second round-trip, need ID sync.
Vectors + metadata: metadata (category, date) enables filtering; many VDBs support payloads/columns.
Vectors + raw payload: store original text/snippets in VDB for direct return; higher storage; for RAG, often store minimal metadata in VDB + full docs elsewhere keyed by ID.
Storing only vectors reduces raw content exposure; for compliance see privacy and reconstruction; access control still applies.
Large payloads can affect storage and response size; lazy-load or store references for low latency.

Vectors only: lean VDB, external content

Storing only vectors keeps the VDB small and fast: you do a nearest-neighbor search, get back IDs, then look up the actual content in another store (e.g. PostgreSQL, object store). That separates search from storage and lets you change or redact raw data without touching the index, but adds a second round-trip and requires keeping IDs in sync.

Storing metadata (e.g. category, date) alongside the vector in the VDB allows metadata filtering at query time so you only search within a subset; many VDBs support this with payloads or columns. Trade-off: vectors-only minimizes VDB size and can improve privacy (no raw text in the VDB), but you need a reliable external store and ID mapping. When to use: when you want a clear separation between search and content, or when content is large or frequently updated.

Storing raw data in the VDB

Storing full raw data (e.g. the original text) in the VDB as a payload is convenient for returning snippets or titles directly from the search response and simplifies the architecture, at the cost of higher storage and potentially larger responses. See storing payloads alongside vectors for implementation details.

Practical tip: keep payloads lean if you need low latency; put large content in object storage and store only a reference in metadata. Many VDBs let you attach a payload (e.g. JSON) with each point—you can store title, snippet, doc_id, and optionally full text, subject to size limits. If you need to redact or update stored text, update the payload (many VDBs support update by ID); if you store only vectors and text elsewhere, update the external store—the vector stays valid unless you change the underlying content and re-embed.

Common pattern for RAG

For RAG and semantic search, a common pattern is: store vectors + minimal metadata (e.g. doc_id, chunk_id) in the VDB, and keep full documents in a separate store keyed by ID so you can fetch and pass only the needed chunks to the LLM. That keeps the VDB focused on search while the document store holds the source content.

For compliance, storing only vectors reduces exposure of raw content in the VDB; see privacy and reconstruction. You still need to protect the external store and control who can resolve IDs to content. Pipeline: index pipeline writes vectors (and optionally payload) to VDB and may write full docs to another store; query pipeline runs ANN, gets IDs (and optional payload), then fetches full content if needed.

Frequently Asked Questions

What if I need to redact or update the stored text?

If raw text is in the VDB payload, update the payload (many VDBs support update by ID). If you store only vectors and text elsewhere, update the external store; the vector stays valid unless you change the underlying content and re-embed.

Does storing large payloads slow down search?

Search (ANN) is over vectors; payload size can affect storage and network when returning results. Some systems lazy-load payloads. Keep payloads lean if you need low latency; put large content in object storage and store only a reference in metadata. See measuring latency.

Can I store both vectors and full text in the same point?

Yes. Many VDBs let you attach a payload (e.g. JSON) with each point. You can store title, snippet, doc_id, and optionally full text, subject to size limits.

For compliance, is it safer to store only vectors?

Storing only vectors reduces exposure of raw content in the VDB; see privacy and reconstruction. You still need to protect the external store and control who can resolve IDs to content.