Basic Fundamentals · Topic 1

What is a vector in the context of machine learning?

In machine learning, a vector is an ordered list of numbers that represents a single data point. These numbers—often called dimensions or components—encode the object’s features so that models and databases can compare, search, and reason about data by nearest neighbor or similarity.

Summary

A vector in ML is an ordered list of numbers (dimensions) representing one data point.
Vectors can be thought of as points in space; high-dimensional vectors (hundreds or thousands of dimensions) are common.
Feature vectors encode hand-crafted or learned features; embeddings are vectors produced by models that capture semantic or structural information.
Similar items sit close together in vector space, enabling semantic search and vector database retrieval by similarity rather than exact match.
Vectors are the foundation for vector database operations: storage, indexing, and query lifecycle all rely on them.

Vectors as ordered lists of numbers

You can think of a vector as a point in space: each number is a coordinate along one axis. For example, a 3-dimensional vector [0.2, −0.5, 0.8] is one point in 3D space. In ML we usually work with high-dimensional vectors (hundreds or thousands of dimensions), where each dimension often corresponds to some learned or hand-crafted feature. That’s why the term feature vector is common: the vector is the list of feature values for one item.

Mathematically, a vector is an element of a vector space—typically R^d for d dimensions—and supports operations like addition, scalar multiplication, and inner product. In practice, we represent them as arrays: [0.1, -0.3, 0.9, ...]. The ordering matters: the first number is the first dimension, the second is the second, and so on. Two vectors are compared using cosine similarity, Euclidean distance (L2), or other dot product–based metrics that vector databases use for nearest neighbor search.

The dimensionality of a vector affects both representational power and computational cost. Low-dimensional vectors (e.g. 2D or 3D) are easy to visualize but often cannot capture enough structure for real-world tasks. High-dimensional vectors (e.g. 384, 768, or 1536 dimensions from modern embedding models) can encode fine-grained semantic or visual information, at the cost of more storage and computation when comparing or indexing them.

Vectors in machine learning pipelines

Vectors are the input and output of many ML components. An embedding is a vector produced by a model (e.g. a neural network) that captures semantic or structural information about the input—text, image, or user behavior. Those embeddings can be dense (most dimensions non-zero) or sparse (mostly zeros). They live in a latent space where “similar” things sit close together, which is what makes semantic search and vector databases useful: you store vectors and retrieve by similarity rather than exact match.

In a typical pipeline, raw data (e.g. text or images) is first converted into vectors by an embedding model. Those vectors are then inserted into a collection or index in a vector database. At query time, the query (e.g. a search phrase) is also turned into a vector using the same model; the database runs approximate nearest neighbor (ANN) search to return the stored points whose vectors are closest to the query vector. The entire lifecycle of a vector query depends on this representation.

Different modalities (text, images, audio) can be embedded into the same or different vector spaces. When the same model is used for both query and corpus, distances in that space reflect semantic or perceptual similarity. When you mix modalities (e.g. search images with text), you need a model that embeds both into a shared space, such as in multi-modal embedding setups.

Why vectors matter for vector databases

So in the context of machine learning and vector databases, a vector is the numeric representation of an item that enables comparison, clustering, and retrieval by meaning or structure. Traditional B-tree indexes cannot efficiently answer “find items similar to this vector” because they are built for one-dimensional order and exact match. Vector databases use indexes (e.g. HNSW, IVF) designed for high-dimensional similarity, and they rely on the fact that vectors from good embedding models cluster in latent space where proximity implies semantic or structural similarity. That’s the foundation for recommendations, semantic search, and RAG.

Choosing the right dimensionality and distance metric for your vectors depends on your embedding model and use case. Once vectors are stored in a vector database, all subsequent operations—filtering, indexing, and querying—rely on this single numeric representation. Understanding what a vector is and how it is produced is therefore the first step in building or using any vector-powered application.

Frequently Asked Questions

What is the difference between a vector and an embedding?

An embedding is a specific kind of vector: one produced by a model (e.g. transformer, CNN) so that similar inputs map to nearby vectors in a shared space. So every embedding is a vector, but not every vector is an embedding—e.g. hand-crafted feature vectors are vectors but not necessarily “embeddings” in the ML sense.

How many dimensions do vectors typically have in vector databases?

It depends on the embedding model. Common text models produce 384, 768, or 1536 dimensions; image and multimodal models can be similar or larger. Higher dimension often means more expressive power but more memory and compute; see impact of embedding model dimensionality on VDB performance.

Can vectors have negative values?

Yes. Embeddings from neural networks typically have both positive and negative components. Similarity is still well-defined via cosine or L2. See handling negative values in vector components for indexing and storage considerations.

Why is “nearest neighbor” the main operation for vectors?

Because the goal is to find items that are similar—in meaning, structure, or behavior—not just equal. In vector space, “nearest” under a distance metric (e.g. L2 or cosine) corresponds to “most similar.” That’s why nearest neighbor search is the core operation in vector databases and ANN indexes.