Embeddings & Data Prep · Topic 22

How are embeddings generated for images (CNNs/ViTs)?

Image embeddings are produced by neural networks—traditionally CNNs (Convolutional Neural Networks) and increasingly Vision Transformers (ViTs)—that map a single image to a vector. Similar images end up close in this space, so you can store them in a vector database and search by visual similarity. These embeddings are the backbone of visual search, duplicate detection, and image-driven recommendation systems.

Summary

CNNs: convolutions + pooling; final layer (e.g. before classifier or global average pooling) used as the embedding.
ViTs: image → patches → transformer; [CLS] or mean of patch outputs = image vector; fixed-length dense, suitable for cosine or L2.
For text–image search use multi-modal (e.g. CLIP); image embeddings power visual search, dedup, recommendation in a VDB.
Trade-offs: ViTs usually better quality; CNNs can be faster and lighter. Normalize when using cosine or dot product.
Practical tips: use consistent input size and preprocessing at index and query time; align model choice with latency and memory constraints.

CNN-based image embeddings

In CNNs, the image passes through convolutional and pooling layers that capture local patterns and hierarchy; the final layer before the classifier (e.g. the last fully connected layer or a global average pooling output) is used as the embedding. That vector is a feature vector summarizing the image content. Models like ResNet, EfficientNet, or older VGG are often used with their classification head removed and the penultimate layer taken as the embedding.

The result is fixed-length and dense, suitable for cosine or L2 similarity in a vector database. CNN embeddings are well understood, efficient on CPU/GPU, and widely supported. Trade-off: they can lag behind ViTs on fine-grained or abstract visual similarity because they were originally trained for classification; for best quality on modern benchmarks, ViT or hybrid backbones are often preferred. When to use CNNs: when you need low latency, smaller models, or compatibility with older pipelines.

Vision Transformers (ViTs)

In ViTs, the image is split into patches, each patch is linearly embedded, and transformer layers process the sequence of patch tokens (plus a [CLS] token). The [CLS] token output or the mean of patch outputs gives the image vector. ViTs often outperform CNNs for representation quality and are the backbone of many modern image encoders.

Both CNN and ViT yield a single vector per image that can be normalized and stored in a collection for nearest neighbor search. ViTs typically need more compute and data for training but offer better transfer and robustness to distribution shift. Practical tip: use the same preprocessing (resize, crop, normalization) at index and query time so that distances are meaningful.

Joint text–image and use cases

For joint text–image search (e.g. “find images like this description”), multi-modal models like CLIP embed both text and images into the same space so one index can serve both. Image-only embeddings power visual search, duplicate or near-duplicate detection, and recommendation (“images like this”). Once generated, they are indexed like any other vector in a VDB and queried via the same lifecycle.

When to use image-only vs. multi-modal: use image-only encoders when queries are always images (e.g. “find similar product photos”). Use CLIP-style multi-modal when you need to query with text, with images, or both. Pipeline tip: batch image encoding for bulk ingestion and consider dimensionality and index type (e.g. HNSW vs. IVF) based on scale and latency targets; see choosing the right embedding model and impact of embedding model dimensionality for VDB performance.

Frequently Asked Questions

What input size do image embedding models expect?

Typically 224×224 or 384×384 pixels; the image is resized and normalized. Check the model’s preprocessing (e.g. ImageNet normalization). Consistency between indexing and query is required—use the same resize and normalization everywhere.

Can I use the same VDB collection for text and image embeddings?

Only if they’re in the same space, e.g. with CLIP. Then one collection can hold both; you can query with text or image. Separate text-only and image-only models produce incompatible spaces.

Do image embeddings need to be normalized?

If your VDB uses cosine similarity or dot product, yes—normalize so that similarity reflects direction. Many image encoders output normalized vectors by default. See normalizing vectors for why it’s necessary.

How do I choose between CNN and ViT for image search?

ViTs generally give stronger representations; CNNs can be faster and lighter. For multi-modal (text–image), CLIP-style models use ViT or similar. Pick based on quality vs. latency and memory; see measuring latency and memory usage per million vectors for benchmarking.