Embeddings & Data Prep · Topic 21

How are embeddings generated for text (Transformers)?

Text embeddings are produced by transformer models that take tokenized input and output a single vector per sentence or passage. The model’s parameters are trained so that semantically similar text maps to nearby vectors, which is why these embeddings power semantic search and ingestion into a vector database. Understanding this pipeline is essential for building retrieval systems, RAG applications, and any system that relies on meaning-based search.

Summary

Pipeline: raw text → tokenization → transformer forward pass → pooling (e.g. [CLS] or mean) → one embedding vector.
Models (BERT, Sentence-BERT, E5, BGE, text-embedding-3) output fixed-length dense vectors, often normalized for cosine or dot-product search.
Choosing the right model affects dimensionality and quality; fine-tuning and batching are common for domain and scale.
Query and corpus must use the same model and normalization so distances are comparable; see handling model version drift when upgrading.
Practical tips: align max input length with chunk size, use the model’s tokenizer, and batch embedding calls for ingestion at scale.

Pipeline: text to vector

The pipeline is: raw text → tokenization → transformer forward pass → pooling over token outputs → one embedding vector. The tokenizer splits text into subword or word tokens and feeds them to the transformer; the model outputs a representation per token. To get a single vector per sentence or passage, we pool: often the [CLS] token output (in BERT-style models) or mean/max pooling over all token representations. The result is a fixed-length dense vector, typically normalized to unit length for cosine similarity or dot-product search in the vector database.

Each step affects downstream quality and performance. Tokenization determines vocabulary coverage and handling of rare or out-of-vocabulary terms. The choice of pooling (CLS vs. mean vs. max) can favor different aspects of the text—CLS is trained to summarize the sequence in BERT, while mean pooling often gives more stable sentence-level representations. Normalization is required when your vector database uses cosine or inner-product similarity so that scores are comparable across documents of different lengths or styles.

Model families and training

Models like BERT, Sentence-BERT (SBERT), and modern embedding models (e.g. OpenAI text-embedding-3, Cohere embed, E5, BGE) are trained so that semantically similar sentences have similar vectors. Training is often contrastive or triplet-based: pull together matching pairs, push apart non-matching. That creates the latent space where “close” means “similar in meaning,” which is what makes semantic search and vector query useful.

Trade-offs between model families include size (speed vs. quality), dimensionality (storage and index build cost), max sequence length (how much text per vector), and language or domain coverage. Encoder-only models (BERT-style) are standard for symmetric tasks (e.g. passage retrieval); some systems use separate query and document encoders. When to use which: small models for low latency and high throughput, larger models when recall and nuance matter more than cost.

Choosing the right model affects dimensionality and quality; you can fine-tune for your domain. For ingestion at scale, batching embedding calls is common. Fine-tuning on in-domain pairs (query–passage or passage–passage) usually improves retrieval metrics; ensure your evaluation set reflects real queries to avoid overfitting.

Using text embeddings in a VDB

Storing these vectors in a VDB lets you query by meaning rather than keywords—the core of RAG and semantic retrieval. The same model must be used for indexing and for embedding the query so that query and corpus live in the same space. See handling model version drift when you change or update the model.

Practical tips: run the same preprocessing (and tokenizer) at index and query time; align max input length with your chunking strategy so that chunks fit within the model’s context; monitor for distribution shift if your content or query mix changes. For multi-tenant or multi-domain setups, either use one model with broad training or maintain separate collections per model/domain so that distances remain comparable within each space.

Frequently Asked Questions

What is the maximum input length for text embeddings?

It depends on the model. Many support 512 or 8192 tokens; some support 32k or more. Inputs longer than the max are typically truncated or split (e.g. with chunking). Check the model’s spec and align with your chunk size strategy (see overlapping chunks vs. fixed-size chunks).

Do I need to normalize text embeddings before storing in a VDB?

Many APIs return normalized vectors by default. If your VDB uses cosine or dot product, vectors should be normalized; check your collection metric and the model’s output. See normalizing vectors and cosine similarity for details.

Can I use a different model for query vs. documents?

No. Query and corpus must use the same model (and same normalization) so that distances are comparable in one latent space. Different models produce incompatible spaces and will degrade retrieval quality.

How does tokenization affect embedding quality?

Tokenization determines how text is split into tokens; out-of-vocabulary or rare tokens can hurt quality. See handling OOV terms. Use the tokenizer that comes with the model and keep it consistent at index and query time.