Basic Fundamentals · Topic 12

The role of a “Collection” or “Index” in a VDB

In a vector database, a collection (or index, depending on the product) is a named container that holds points sharing the same vector dimensions, distance metric, and often the same metadata schema. You query “within” a collection, so the system only searches that set of vectors—like a table in a relational DB, but optimized for similarity search.

Summary

A collection (or index) groups points with the same dimension, distance metric, and often metadata schema.
You query within one collection; this isolates use cases and keeps query scope and resource usage predictable.
Collections let you tune index parameters (e.g. HNSW M, efConstruction) per dataset; they’re the unit of indexing in the VDB architecture.
Without collections you’d mix incompatible vectors or search the entire DB on every query.

What a collection defines

A collection defines the dimensionality of the vectors (e.g. 768, 1536), the distance or similarity metric (e.g. cosine, L2), and often the metadata schema (field names and types for filtering). All vectors in a collection must have the same dimension and are compared with the same metric so that nearest neighbor is well-defined. Collections let you isolate different use cases (e.g. one for product embeddings, one for support docs) and tune index parameters per dataset.

The distance metric is fixed at collection creation in most systems. You cannot mix cosine and L2 in the same collection because the ordering of results would be undefined. If you need to experiment with metrics, create separate collections or re-ingest after changing the metric.

Why not put everything in one big bucket?

Without this grouping, you’d mix incompatible vectors (different dimensions or metrics) or search the entire database on every query, which hurts latency and makes it impossible to use different embedding models per use case. In the architecture of a VDB, the collection is the unit of indexing: when you insert points, they’re added to a collection, and the underlying index (HNSW, IVF, etc.) is built or updated for that collection. Queries then target a single collection, which keeps query latency and resource usage predictable.

Separate collections also allow different index types or parameters per use case. For example, a high-recall collection might use HNSW with large efConstruction, while a low-latency collection might use IVF with small nprobe. You choose the collection at query time based on the trade-off you need.

Namespaces and multi-tenancy

Some systems use namespaces or partitions inside a collection for logical separation (e.g. per tenant or per environment). That can reduce the search set and support multi-tenant isolation. The key idea remains: a collection is the scope of one index and one schema, and you always query “within” it.

When using namespaces or partitions, filtering by tenant or partition key at query time restricts the search to a subset of the collection. This avoids building separate indexes per tenant while still isolating data and improving query performance for single-tenant requests.

Frequently Asked Questions

Can I have multiple collections with different dimensions?

Yes. Each collection has its own dimension and metric. You’d typically use different embedding models or use cases per collection. At query time you specify which collection to search.

Is a collection the same as a table?

Conceptually similar: like a table, a collection holds rows of data (here, points) with a fixed schema. Unlike a typical table, it’s optimized for nearest neighbor search and has an ANN index, not B-trees.

Can I change the index type or parameters after creation?

Varies by product. Some allow tuning (e.g. efSearch) at query time; changing dimension or metric usually requires a new collection and re-ingestion. See dynamic indexing where supported.

How many collections should I create?

One per logical dataset that shares the same embedding model, dimension, and metric. Separate by use case (e.g. products vs. docs) or by tenant if you need isolation. Too many tiny collections can add overhead; too few can mix unrelated data.