Distributed Systems & Scaling · Topic 160

Cost-optimization: Spot instances for vector indexing

Spot instances (AWS Spot, GCP Preemptible, Azure Spot VMs) offer spare capacity at a large discount in exchange for the risk of interruption—the cloud provider can reclaim the instance with short notice. They are well-suited for batch vector indexing and other non-latency-critical, interruptible workloads to lower cost per query and infrastructure spend.

Summary

Spot instances (AWS Spot, GCP Preemptible, Azure Spot) offer spare capacity at a discount in exchange for interruption risk. Well-suited for batch vector indexing and other non-latency-critical, interruptible workloads to lower cost per query and infrastructure spend.
Use cases: distributed index construction or full reindexing on spot pools; batch embedding and writing vectors (persist to durable storage); some run read replicas on spot with replication and checkpointing. Design jobs to be resumable; use multiple instance types or AZs; mix spot with on-demand for critical path.
With compute-storage separation, indexing compute can be spot-based while the durable index lives in object storage, so preemption does not lose data. Pipeline: run job on spot, persist to durable storage, on preemption retry or resume. Practical tip: make jobs resumable; use multiple instance types to reduce reclaim risk.

Use cases

Use cases: (1) Index building—run distributed index construction or full reindexing on spot pools; if a node is preempted, the job can be retried or the remaining nodes can continue and rebuild the missing shard. (2) Offline embedding—batch embed documents and write vectors on spot; persist results to durable storage so progress is not lost. (3) Query tier (with care)—some systems run read replicas on spot; when a spot node is lost, traffic is shifted to on-demand or other spots. Replication and checkpointing help tolerate preemption.

Best practices

Best practices: design jobs to be resumable (checkpoints, idempotent writes); use multiple instance types or availability zones to reduce the chance that the whole pool is reclaimed; mix spot with on-demand for critical path (e.g. coordinators on-demand, workers on spot). For compute-storage separation, indexing compute can be spot-based while the durable index lives in object storage, so preemption does not lose data. Pipeline: run job on spot, persist to durable storage, on preemption retry or resume from checkpoint. Practical tip: make jobs resumable; use multiple instance types to reduce reclaim risk.

Frequently Asked Questions

What are spot instances and when to use them for VDBs?

Spot (AWS Spot, GCP Preemptible, Azure Spot) offer spare capacity at a large discount; the provider can reclaim with short notice. Well-suited for batch index building, offline embedding, and other interruptible workloads to lower cost.

Can I run index building on spot?

Yes. Run distributed index construction or full reindexing on spot pools; if a node is preempted, retry or let remaining nodes continue and rebuild the missing shard. Persist results to durable storage so progress is not lost.

What are best practices for spot?

Design jobs to be resumable (checkpoints, idempotent writes); use multiple instance types or availability zones; mix spot with on-demand for critical path (e.g. coordinators on-demand, workers on spot). See compute-storage separation so preemption does not lose data.

Can I run read replicas on spot?

Some systems run read replicas on spot; when a spot node is lost, traffic shifts to on-demand or other spots. Replication and checkpointing help tolerate preemption. Use with care for latency-critical query tiers.