Embeddings & Data Prep · Topic 36

Dimensionality reduction techniques (PCA, t-SNE) for visualization

Embeddings from modern models live in hundreds or thousands of dimensions. To visualize how documents or queries sit in latent space—e.g. to debug clustering or check for drift—we use dimensionality reduction to project vectors down to 2D or 3D. PCA and t-SNE are two common techniques. Keep full-dimensional vectors for search; use these methods only for analysis and visualization.

Summary

PCA: linear, preserves variance, fast and deterministic; good for broad clusters and main directions.
t-SNE: non-linear, emphasizes local structure; clusters more visually separated; stochastic and slower; use for exploration, not for ANN.
Keep full-dimensional vectors for search; use these for analysis and visualization only.
2D projection does not preserve nearest neighbors exactly; neither PCA nor t-SNE is a substitute for full-dimensional search in the VDB.
UMAP is another option that often preserves structure better than t-SNE and can be faster; use for visualization, not for building indexes.

PCA

PCA (Principal Component Analysis) finds linear projections that preserve the most variance. It’s fast, deterministic, and good for a quick look at the main directions of variation in your embedding set. The first two principal components often capture a large fraction of variance, so a 2D PCA plot can reveal broad clusters or outliers.

PCA does not aim to preserve local neighborhoods as strongly as t-SNE. When to use: for a fast, reproducible overview of your embedding distribution; for checking that different batches or model versions have similar global structure. Pipeline tip: run PCA on a sample of vectors (e.g. 10k) for speed; you can use PCA-reduced vectors for a lower-D index but you lose information—often better to keep full dimension for search and use PCA only for visualization or analysis; see impact of embedding model dimensionality on VDB performance.

t-SNE and UMAP

t-SNE (t-distributed Stochastic Neighbor Embedding) emphasizes local structure: points that are close in high dimensions tend to stay close in 2D, so clusters and subclusters are often more visually separated. t-SNE is non-linear and stochastic (different runs give different layouts), and it can be slow on large sets.

Use it for exploratory visualization and quality checks (e.g. “do similar documents sit together?”), not for building indexes—the reduced space is not meant for ANN search. UMAP is another option that often preserves structure better than t-SNE and can be faster. For a vector database pipeline, keep full-dimensional vectors for search; use these methods only for analysis and visualization. Trade-off: t-SNE scales poorly with n; use subsampling or approximate methods for very large corpora; PCA is O(n × d) and much faster.

Frequently Asked Questions

Can I use PCA-reduced vectors for ANN search?

You can project to lower D and build an index on that, but you lose information; quality may drop. Often better to keep full dimension for search and use PCA only for visualization or analysis.

Why is t-SNE slow on large sets?

t-SNE scales poorly with n (number of points); use subsampling or approximate methods for very large corpora. PCA is O(n × d) and much faster.

Does 2D projection preserve nearest neighbors?

Not exactly. PCA preserves global variance; t-SNE emphasizes local neighborhoods but distorts distances. Neither is a substitute for full-dimensional nearest neighbor in the VDB.

What is UMAP?

UMAP is another dimensionality reduction method that often preserves structure better than t-SNE and can be faster. Like t-SNE, use for visualization and exploration, not for building search indexes.