Cloud-native architecture (Separation of Compute and Storage)
Separating compute and storage means storing vector indexes and data on durable, scalable object or block storage (e.g. S3, cloud disks) while running query and indexing compute on stateless or ephemeral nodes. Compute nodes load or stream only what they need, so you can scale query capacity independently of data size and avoid keeping full copies in every node’s RAM.
Summary
- Indexes and data live on durable storage (e.g. S3, cloud disks); query and indexing run on stateless/ephemeral compute that loads or streams what it needs. Scale query capacity independently of data size.
- Benefits: elasticity (add query nodes without copying data), cost (storage cheaper than RAM; spot compute), recovery (new nodes attach to same storage). Trade-off: latency—use caching, mmap, or Disk-ANN-style layouts.
- Segments/index files in object storage; metadata in a distributed store; compute (e.g. Kubernetes pods) mounts or fetches on demand and auto-scales on CPU, memory, QPS. Pipeline: write to durable storage, compute mounts/fetches segments, query runs on compute. Practical tip: use caching and disk-optimized layouts to keep latency acceptable.
Benefits and trade-offs
Benefits: (1) Elasticity—add more query nodes when traffic spikes without copying the whole dataset. (2) Cost—storage is cheaper than RAM; compute can use spot or preemptible instances. (3) Recovery—new nodes can attach to the same storage after a failure. The trade-off is latency: reading from remote or disk storage is slower than in-memory, so designs use caching, mmap, or layouts optimized for sequential read (e.g. Disk-ANN).
Typical cloud-native layout
Cloud-native VDBs often store segments or index files in object storage, with metadata in a distributed store. Compute tiers (e.g. Kubernetes pods) mount or fetch data on demand and can be auto-scaled based on CPU, memory, or QPS. This pattern aligns with how many managed vector DB services operate in the cloud.
Pipeline: writes land on durable storage; compute nodes mount or fetch segments on demand; queries run on compute. Trade-off: remote/disk reads add latency; use caching, mmap, or disk-optimized layouts. Practical tip: use caching and disk-optimized layouts to keep query latency acceptable when separating compute and storage.
Frequently Asked Questions
What does “separating compute and storage” mean for a VDB?
Vector indexes and data are stored on durable object or block storage (e.g. S3); query and indexing run on separate, stateless compute nodes that load or stream only what they need. You can scale query nodes without copying the full dataset into each node’s RAM.
What are the main benefits?
Elasticity—add query nodes on demand. Cost—storage is cheaper than RAM; compute can use spot instances. Recovery—new nodes attach to the same storage after failure. The trade-off is higher latency than in-memory; caching and disk-optimized layouts help.
How do compute nodes get the data?
They mount object/block storage or fetch segments or index files on demand. With mmap, the OS page-caches hot regions. Layouts like Disk-ANN optimize for sequential read so that disk latency is manageable.
How does this fit with Kubernetes?
Kubernetes pods run the query/indexing compute; they mount or fetch from object storage and can be auto-scaled (HPA) based on CPU, memory, or QPS. Storage is external (e.g. S3, PVC), so pods can be stateless and replaced without moving data.