Indexing Algorithms - HNSW · Topic 77

Deleting a node in an HNSW graph (the “Orphan” problem)

In HNSW, nodes are connected by bi-directional edges. Deleting a node by simply removing it would leave other nodes pointing to it—those references become orphans (dangling pointers). Traversal that follows such a link can fail or return invalid results, and the graph can become disconnected.

Summary

Strategies: (1) Soft delete—mark deleted, skip at query time; reclaim with garbage collection or rebuild. (2) Remove and repair—update every neighbor’s list (expensive). (3) Rebuild—periodic full rebuild. Many systems use soft deletes + background compaction.
One of the main limitations of graph-based indexes.
Pipeline: delete request → soft-delete flag set (or hard delete + repair of neighbor lists); at query time skip deleted nodes when expanding.
Trade-off: soft delete avoids orphans and is fast but delays space reclaim; repair is exact but O(M × layers × log n) per delete.
How VDBs handle deletes (soft vs. hard) and entry point selection (when entry is deleted) are related concerns.

Why orphans occur

Each node stores up to M neighbor IDs per layer, and each of those neighbors stores a link back. If you remove a node and free its memory, any neighbor that still holds that node’s ID has a dangling pointer. Following that link during search leads to invalid memory or a missing node. So deletion cannot be “remove and forget”—you must either keep the node (soft delete) or fix all neighbor lists that point to it (repair).

Strategies

Common strategies: (1) Soft delete—mark the node as deleted (e.g. in a bitmap or flag) but leave it in the graph; at query time, skip deleted nodes when expanding neighbors. No orphans, but memory is not reclaimed until garbage collection or rebuild. (2) Remove and repair—delete the node and update every neighbor’s list to remove the link; expensive when the node has many neighbors and appears on many lists. Cost is O(M × number of layers × log n) per delete because the deleted node can appear in up to M × (layers) neighbor lists. (3) Rebuild—periodically rebuild the index from the current set of vectors so deleted data is gone and the graph is clean.

Many production vector DBs use soft deletes for simplicity and performance, with background compaction or full rebuild when the fraction of deleted nodes grows too large. Batch deletes: soft-delete many, then run one compaction or rebuild to avoid repeated repair cost.

Pipeline: delete and query

Delete pipeline: (1) receive delete by ID; (2) either set soft-delete flag or remove node and repair all neighbor lists; (3) optionally trigger compaction. Query pipeline: when expanding neighbors, skip any node that is marked deleted (or follow only non-deleted links if repair was used). So query logic must be aware of the delete strategy.

Trade-offs and practical tips

Trade-off: soft delete is fast and avoids orphans but delays space reclaim and can slightly increase query cost (checking flags). Repair is exact and reclaims space immediately but is expensive per delete. Practical tip: use soft deletes and schedule compaction or rebuild when delete fraction exceeds a threshold; see how VDBs handle deletes for broader context.

Frequently Asked Questions

Why not just null out the node?

Neighbors would still hold the node’s ID; following that link would point to invalid or freed memory. You must either keep the node (soft delete) or fix all neighbor lists.

How do I reclaim space after soft deletes?

Run a full rebuild, or a compaction that builds a new graph from non-deleted vectors and swaps it in. See how VDBs handle deletes.

Does repair scale to large M?

Removing one node requires updating up to M × (number of layers) neighbor lists; each of those nodes has the deleted node in their list. Cost is O(M × layers × log n) per delete.

Can I batch deletes?

Yes. Soft-delete many, then run one compaction or rebuild. Avoids repeated repair cost.