Database Internals & Storage · Topic 116

Snapshotting and Backup strategies

Snapshotting creates a point-in-time, consistent copy of the vector database state—vectors, indexes, and metadata—so you can restore to that state or clone for testing. Backup strategies combine snapshots with WAL (and optionally off-site copy) to meet RPO/RTO and disaster recovery goals. This topic covers implementation, distributed snapshots, and restore.

Summary

Snapshot: point-in-time, consistent copy of vectors, indexes, and metadata for restore or clone. Backup: snapshots + WAL (and off-site copy) for RPO/RTO and disaster recovery.
Snapshot = record set of immutable segments and index files (e.g. after WAL flush), then copy or reference; copy-on-write or filesystem snapshots (LVM, EBS) make it efficient. Full or incremental backups; WAL archive for recovery to a timestamp.
Distributed: snapshots per-shard or coordinated; consistency level defines “point-in-time.” Restore = load snapshot + replay WAL to desired point. Essential for durability and compliance.
Trade-off: frequent snapshots improve RPO but increase storage and possibly write impact; WAL archive enables point-in-time recovery between snapshots.
Practical tip: schedule snapshots and WAL archive to meet RPO; test restore regularly; for tiered storage, include all tiers in backup.

How snapshots and backups are implemented

A snapshot is often implemented by recording the set of immutable segments and index files at a moment in time (e.g. after flushing WAL), then copying or referencing those files. Copy-on-write or filesystem snapshots (e.g. LVM, EBS) can make this efficient.

Backups may be full (all data) or incremental (only changes since last backup), with WAL archived so you can recover to a specific timestamp. Pipeline: flush WAL → record segment set and index files → create snapshot (copy or COW) → optionally copy to off-site; for restore: load snapshot → replay WAL to target time.

Distributed snapshots and restore

For distributed VDBs, snapshots may be per-shard or coordinated across nodes; consistency level during snapshot affects what “point-in-time” means. Restore typically involves loading the snapshot and replaying WAL up to the desired point. Good snapshot and backup design is essential for durability and compliance.

Trade-off: coordinated cross-shard snapshots give a global point-in-time but are harder to implement; per-shard snapshots are simpler but may not be perfectly aligned. Practical tip: align backup strategy with RPO/RTO; for tiered or multi-tier VDBs, ensure all tiers are included in backup and restore procedures.

Frequently Asked Questions

What is the difference between snapshot and backup?

A snapshot is a point-in-time copy of the DB state, often local or in the same region. A backup usually implies durable, restorable copy—often combining snapshots with archived WAL and optionally off-site storage for disaster recovery.

Can I restore to any point in time?

If WAL is archived, you can restore the latest snapshot and replay WAL up to a chosen timestamp (point-in-time recovery). Without WAL archive, you can only restore to the snapshot time. RPO (recovery point objective) depends on snapshot and WAL retention.

Do snapshots block writes?

It depends. Copy-on-write or filesystem snapshots often allow writes during snapshot (the snapshot sees the pre-cow state). Some VDBs briefly quiesce or flush before recording the segment set. Check your vendor’s behavior for write impact.

How do I backup a tiered or multi-tier VDB?

All tiers (fast storage + object storage like S3) must be included in the backup and snapshot semantics so that restore is consistent. Segment references and WAL across tiers need to be coordinated; see your vendor’s tiered storage and backup docs.