Database Internals & Storage · Topic 125

Garbage collection of deleted vectors

When you delete a vector, the vector database must stop returning it in search results and eventually reclaim space. Soft deletes mark the point as deleted (e.g. in a bitmap); garbage collection (GC) is the process of physically removing those points from the index and storage so that memory and disk don’t grow forever. This topic covers why GC is needed and how to tune it.

Summary

GC physically removes soft-deleted points from index and storage so memory and disk don’t grow forever. After soft delete (e.g. bitmap), background merge/compaction builds new segment from live vectors only = hard delete.
In HNSW, in-place node removal causes orphaned links; so soft delete + filter at query time, then GC via merge. Until GC runs, deleted vectors consume space; queries exclude via delete bitmap.
GC policy: frequent compaction = fast reclaim, more CPU/I/O; lazy = batch reclamation. “Compact”/“optimize” API to trigger during low traffic. Critical for delete-heavy (e.g. queues, time-windowed) workloads.
Trade-off: aggressive GC reclaims space quickly but uses CPU and I/O; lazy GC batches reclamation and may leave space used longer.
Practical tip: for delete-heavy workloads enable and tune GC; use compact/optimize during off-peak; monitor delete ratio and segment count.

Why GC is needed after soft deletes

In graph-based indexes like HNSW, deleting a node is tricky: you can’t simply remove it without leaving orphaned links. So many systems do soft delete first (mark as deleted, filter at query time via a bitmap) and then run a background merge or compaction that builds a new segment from live vectors only, effectively doing a hard delete. Until that runs, deleted vectors still consume space and may still be present in the in-memory graph; queries exclude them using the delete bitmap.

Pipeline: soft delete marks ID in bitmap → queries filter out deleted IDs → background GC/compaction selects segments with high delete ratio → builds new segment with only live vectors → drops old segment and reclaims space.

GC policy and when to run it

GC policy trades off space vs. write amplification: frequent compaction reclaims space quickly but costs CPU and I/O; lazy compaction keeps deletes “logical” longer and batches reclamation. Some VDBs expose a “compact” or “optimize” API so you can trigger GC during low traffic.

For append-heavy, delete-rare workloads, GC is less critical; for queues or time-windowed data with many deletes, tuning compaction and GC is important to avoid memory and disk bloat. Practical tip: schedule GC during low QPS or use a compact API after bulk deletes; monitor segment count and delete bitmap size.

Frequently Asked Questions

When does GC run?

Typically in the background: when segment count or delete ratio exceeds a threshold, or on a schedule. Some VDBs let you call “compact” or “optimize” to trigger GC. See how compaction works; GC is often the same or a related process that drops deleted points during merge.

Does GC block queries?

Usually no. GC produces new segments without deleted points; once ready, visibility switches so queries see the compacted view. Old segments are dropped when no longer referenced. Heavy GC can compete for I/O and CPU and raise tail latency.

What if I never run GC?

Soft-deleted vectors keep consuming space; over time memory and disk grow and query cost can increase (more segments, larger delete bitmap). For delete-heavy workloads, enable and tune GC; for append-only, it’s less critical.

Is GC the same as compaction?

Often overlapping. Compaction merges segments and can drop obsolete/deleted entries; that reclaims space (GC). Some systems use “compaction” for segment merge and “GC” specifically for reclaiming deleted vector space—same idea, different naming.