← All topics

Embeddings & Data Prep · Topic 35

Privacy concerns: Can you reconstruct data from a vector?

Embeddings are a compressed, lossy representation of the original data. A natural question for compliance and privacy is: can someone with access to the vector (e.g. in a vector database) reconstruct the source text or image? In practice, exact reconstruction is not feasible, but embeddings can still leak partial or statistical information—so access control and design choices matter.

Summary

  • Exact reconstruction from a single vector is not feasible; mapping is many-to-one, so storing only vectors reduces raw content exposure.
  • Embeddings can leak information: similar inputs → similar vectors; membership/model inversion can infer partial or statistical information.
  • Strict privacy: privacy-preserving vector search (e.g. homomorphic encryption) or avoid indexing highly sensitive content; access control and audit matter.
  • Embeddings may be considered personal data under GDPR if they allow inference about individuals; restrict access and avoid storing sensitive plaintext in payloads.
  • For full erasure, delete both the document and the corresponding vectors (and any payload).

Can you reconstruct the original data?

In practice, exact reconstruction of the original input from a single embedding is not feasible: the mapping from text to vector is many-to-one and throws away most of the bit-level information. You cannot reverse the transformer forward pass to recover the sentence. So storing only vectors is often considered a form of abstraction that reduces exposure of raw content—useful for privacy-sensitive applications where you want similarity search without storing plaintext in the vector database.

Pipeline implication: if you need to minimize exposure, store only vectors (and IDs) in the VDB and keep raw content in a separate, access-controlled store. When to worry: when you attach large payloads or full text to points, anyone with VDB access can read that content; see storing raw data vs. storing only vectors and the importance of IDs and metadata.

Leakage and inference risks

Embeddings can still leak information: similar inputs produce similar vectors, so an attacker with query access might infer that two documents are about the same topic or person. Membership inference and model inversion research shows that in some settings, partial or statistical information can be inferred.

For strict privacy, consider privacy-preserving vector search (e.g. homomorphic encryption or secure MPC), or avoid storing embeddings for highly sensitive content. For most use cases, vectors alone are not sufficient to reconstruct the original data, but access control and audit logging remain important. Trade-off: privacy-preserving search adds substantial compute and latency; use when the privacy requirement justifies the cost. If you delete the source document, the vector remains in the VDB until you delete it and can still be used for similarity; for full erasure, delete both document and corresponding vectors (and any payload).

Frequently Asked Questions

Are embeddings considered personal data under GDPR?

They can be, if they allow inference about individuals. Storing only vectors (no raw text) may reduce exposure but doesn’t automatically make the data non-personal. Consult legal/compliance; access control and purpose limitation still apply.

Can someone with VDB access steal my documents?

They can’t reconstruct exact text from vectors. If you store raw text in metadata/payload, that content is exposed to anyone with VDB access. Restrict access and avoid storing sensitive plaintext in the VDB if possible.

Does homomorphic encryption slow down search?

Yes. Privacy-preserving vector search (e.g. homomorphic or secure MPC) adds substantial compute and often latency. Use when the privacy requirement justifies the cost. See privacy-preserving vector search.

If I delete the source document, is the vector still a risk?

The vector remains in the VDB until you delete it. It can still be used for similarity and might leak that “something similar to X existed.” For full erasure, delete both the document and the corresponding vectors (and any payload).