Embeddings & Data Prep · Topic 23

What is Multi-modal embedding (CLIP)?

Multi-modal embedding means mapping different modalities—e.g. text and images—into the same vector space. CLIP (Contrastive Language–Image Pre-training) is a prominent example: it trains a text encoder and an image encoder so that matching text–image pairs are close and non-matching pairs are far. That lets you query an image vector database with text (or the reverse) using a single index. One collection can serve both text and image queries, simplifying architecture and enabling cross-modal semantic search.

Summary

Multi-modal: text and images (or other modalities) in one latent space; one collection can serve both.
CLIP: contrastive learning on text–image pairs; at inference, embed query (text or image) and run nearest neighbor over the other modality with same cosine or dot-product search.
Other models (ALIGN, Florence, CoCa) follow similar ideas; simplifies architecture and improves cross-modal semantic retrieval.
Trade-off: one model and one index for both modalities vs. separate text/image models; ensure same normalization and dimension for all vectors.
When switching to a new CLIP model, re-embed and re-index all items—you cannot mix old and new vectors in one collection.

How CLIP works

CLIP uses contrastive learning: many text–image pairs are encoded by a text encoder and an image encoder; the model is trained to maximize similarity for true pairs and minimize it for others. Both encoders output vectors in the same dimension so that similarity (e.g. cosine or dot product) is comparable across modalities. At inference, you embed a search query (e.g. “a red car” or an image) and run nearest neighbor search over the stored embeddings—whether they came from text or images.

No need for separate text and image indexes; one collection holds all vectors, and you can mix text and image queries with the same query path (see the lifecycle of a vector query). Pipeline: ingest by embedding text and/or images with the same CLIP model, normalize if using cosine/dot product, then upsert into a single collection; at query time, embed the query (text or image) with the same model and run ANN. Practical tip: use the same preprocessing (image resize, normalization) and tokenization (for text) as in training or the model card to avoid distribution shift.

Why use multi-modal in a VDB

A single embedding model and a single index can serve cross-modal search: “find images that match this caption” or “find captions that describe this image.” That simplifies architecture (one collection, one metric) and improves semantic retrieval across text and images. Other multi-modal models (e.g. ALIGN, Florence, CoCa) follow similar ideas—joint encoders and contrastive or aligned training so that text and image (or video) live in one space.

When to use multi-modal: when you need to query with both text and images, or when your corpus mixes text and images and you want a single search surface. When not to: if you only ever query with one modality and have a large text-only or image-only corpus, a specialized single-modal model may give better quality or lower cost. Trade-off: flexibility and simplicity vs. potentially higher compute (CLIP-style models can be heavier than text-only embedders) and the need to keep all vectors in one space.

Practical considerations

When indexing, you can embed only images, only text, or both into the same collection; query can be either modality. Ensure the same CLIP (or multi-modal) model and normalization are used for all vectors. Dimension is fixed (e.g. 512 or 768) as with any fixed-length embedding; see choosing the right embedding model when comparing multi-modal vs. single-modal.

Frequently Asked Questions

Can I mix CLIP image embeddings with non-CLIP text embeddings in one collection?

No. All vectors in a collection must be from the same model and live in the same latent space. Mixing spaces would make distances meaningless.

Does CLIP support languages other than English?

Base CLIP is English-focused. Multilingual variants (e.g. multilingual CLIP) exist for text in multiple languages; images stay in the same space. Check model cards for language support.

Can I use CLIP for video or audio?

CLIP is text–image. For video, some models encode frames or use video-specific encoders that align with text. For audio, other multi-modal models (e.g. speech–text) exist. The idea—one shared space for multiple modalities—extends to other pairs.

How do I update my index when I switch to a new CLIP model?

New model = new latent space. You must re-embed and re-index all items with the new model; you can’t mix old and new CLIP vectors in one collection.