Immutable segments in VDB storage
Immutable segments are append-only, read-only chunks of data (vectors, metadata, or index pieces) that are never modified after they are written. New inserts and updates go into a new segment (or a mutable “tail” that is later sealed into a segment). This simplifies concurrency—readers don’t block writers—and makes snapshots and compaction easier. This topic covers the segment lifecycle, query merge, and use in vector indexes.
Summary
- Immutable segments are append-only, read-only chunks (vectors, metadata, or index pieces); new inserts/updates go to a new segment or a mutable “tail” that is later sealed.
- Query merges results from all visible segments (and the tail); background compaction merges segments and reclaims space from deletes. Same pattern as LSM-trees.
- Immutability simplifies concurrency (readers don’t block writers), snapshots, and crash recovery. Segments can be cached or paged from disk without in-place updates; supports tiered storage and loading from disk.
- Trade-off: simple write path and snapshots vs. read merge cost when segment count is high; compaction keeps segment count bounded.
- Practical tip: tune segment size and compaction so that typical queries don’t merge too many segments; use tiered storage for cold segments.
Query and compaction
Query time merges results from multiple segments (and possibly a mutable buffer). Over time, background compaction merges small or old segments into larger ones, reclaiming space from deletes and reducing the number of segments to scan. This pattern is familiar from LSM-trees and log-structured storage: write hot path is append-only; merge happens asynchronously.
Pipeline: new writes → mutable tail; when tail is full or timed, seal into new immutable segment; query searches all visible segments (and tail) and merges/deduplicates results; compaction merges segments and drops obsolete data. Trade-off: few segments mean lower merge cost but larger segments; compaction policy (when and what to merge) affects both write and read amplification.
Use in vector indexes
For vector indexes, segments might hold a slice of the graph (e.g. in disk-based layouts) or a batch of vectors for one collection. Immutability also helps with loading from disk and tiered storage: segments can be cached or paged in without worrying about in-place updates.
Practical tip: align segment boundaries with collection or shard boundaries when possible; use compaction to merge small segments and reclaim space from soft deletes so that query merge cost stays low.
Frequently Asked Questions
What is the “mutable tail”?
The in-memory or writable buffer that receives new writes. When it reaches a size or time threshold, it is sealed into an immutable segment and a new tail is started.
How many segments does a query see?
All visible segments plus the current tail. The engine searches each (or merges candidate lists) and then merges/deduplicates results. Fewer segments mean less merge cost.
Can a segment be deleted?
Only when it’s obsolete (e.g. after compaction merged it and no reader references it). Until then, segments are immutable and may be shared by snapshots.
Why not update segments in place?
In-place updates require locking and complicate concurrent reads, snapshots, and crash recovery. Append-only + merge keeps the write path simple and enables efficient snapshots.