Database Internals & Storage · Topic 122

Compression algorithms for metadata (Zstd, Snappy)

Vectors themselves are often stored in a compact form (e.g. quantization), but metadata and payloads—JSON, strings, categorical fields—can dominate storage. Compression (e.g. Zstd, Snappy) reduces disk and memory use and can lower I/O and cost. This topic compares codecs and when to use each.

Summary

Metadata and payloads can dominate storage; compression (Zstd, Snappy, LZ4) reduces disk, memory, and I/O.
Snappy: fast compress/decompress, moderate ratio—good for hot reads and low CPU. Zstd: tunable; better ratio than Snappy at default, higher levels = smaller blobs, more CPU. Applied per segment or block so filtering decodes only needed chunks.
Hot path (e.g. filter on every query) → Snappy or LZ4. Cold or tiered storage → Zstd at higher levels. Some VDBs compress index structures (e.g. graph adjacency) to reduce memory.
Trade-off: smaller storage and less I/O vs. CPU for decompression on read; choose codec by hot vs. cold path.
Practical tip: use Snappy or LZ4 for filter columns; use Zstd for cold or backup tiers; benchmark decompress cost if latency is critical.

Snappy vs. Zstd vs. LZ4

Snappy prioritizes speed: fast compress and decompress with moderate ratio. It’s a good fit when metadata is read often and you want low CPU overhead. Zstd (Zstandard) offers a tunable trade-off: at default levels it compresses better than Snappy with comparable decode speed; at higher levels you get smaller blobs at the cost of more CPU. Many VDBs use one of these (or LZ4) for compressing payload blocks or columnar metadata segments so that more data fits in RAM or cache.

Pipeline: write path compresses blocks or segments before persisting; read path decompresses only the chunks needed for the query (e.g. one column or one segment). That keeps hot-path CPU low while still saving space.

Where and when to compress

Compression is typically applied per segment or per block, not per vector, so that filtering can decode only the needed chunks. The choice affects latency: decompression adds CPU work on read. For hot paths (e.g. filter evaluation on every query), Snappy or LZ4 is common; for colder or bulk storage, Zstd at higher levels can shrink tiered or backup data. Some systems also compress vector indexes (e.g. graph adjacency) with similar algorithms to reduce memory footprint.

Trade-off: smaller storage and lower I/O vs. CPU cost on read. Practical tip: profile decompress time under load; if filter latency is high, consider faster codecs or less compression on the hottest columns.

Frequently Asked Questions

Does compression slow down queries?

Decompression adds CPU on read. For hot metadata (e.g. filter columns), use fast codecs (Snappy, LZ4) so latency stays low. For cold or rarely read data, Zstd at higher levels is fine.

Can I compress vectors too?

Vectors are usually compressed via quantization (e.g. float32→int8) or PQ, not general-purpose codecs. Metadata/payloads benefit from Zstd/Snappy; vector blobs may use codecs in some systems for cold storage.

Why per-segment or per-block compression?

So the engine can decode only the chunks needed for a query (e.g. one column or one segment). Compressing the whole dataset as one blob would force full decompress on every read.

What about LZ4?

LZ4 is very fast (often faster than Snappy) with good ratio. Common in VDBs for hot path; choose Snappy vs LZ4 by benchmarking on your payloads and CPU. Zstd gives better ratio when you can afford more CPU.