Performance, Evaluation & Benchmarking · Topic 167

Measuring Latency (p50, p99)

Latency is the time from when a client sends a vector query until it receives the response. Reporting p50 (median), p95, and p99 percentiles gives a better picture than averages, because a few slow queries can skew the mean and hide tail latency that affects user experience.

Summary

Latency: time from client sending a vector query until response. Report p50 (median), p95, p99 percentiles rather than averages—tail latency affects user experience. High p99 vs. p50 indicates variability (cache misses, hot shards, GC).
p50: half of requests complete within this time. p99: 99% complete within this time. For interactive (search, RAG), p99 or p95 is often the SLA (e.g. p99 < 100 ms). Record per-request duration, compute percentiles; use Prometheus/Grafana for VDB monitoring. In distributed setups, include network and coordinator wait; compare with QPS. Pipeline: request start, request end, compute percentiles. Practical tip: alert on p99 and break down by component (coordinator, shard, network).

Percentiles and SLAs

p50 (median): half of requests complete within this time. p99: 99% of requests complete within this time; the slowest 1% take longer. High p99 relative to p50 indicates variability—e.g. cache misses, hot shards, or GC pauses. For interactive applications (search, RAG), p99 or p95 is often the SLA metric (e.g. “p99 < 100 ms”).

Pipeline: request start, request end, compute percentiles over a window. Practical tip: alert on p99 and break down latency by component (coordinator, shard, network) to find bottlenecks.

Measurement and tooling

Record per-request duration (e.g. from client or server), then compute percentiles over a window (e.g. last 1 minute). Tools like Prometheus (e.g. histogram_quantile) and Grafana are commonly used for VDB monitoring. In distributed setups, latency includes network round-trips and coordinator wait time; breaking down by component helps identify bottlenecks. Compare with throughput (QPS) to understand behavior under load.

Frequently Asked Questions

Why report p50, p95, p99 instead of average?

A few slow queries skew the mean and hide tail latency. p50 (median): half of requests complete within this time. p99: 99% complete within this time; the slowest 1% take longer. For interactive apps (search, RAG), p99 or p95 is often the SLA (e.g. p99 < 100 ms).

What causes high p99 relative to p50?

Variability: cache misses, hot shards, or GC pauses. In distributed setups, network round-trips and coordinator wait add to latency. Breaking down by component helps identify bottlenecks. See throughput under load.

How do I measure latency?

Record per-request duration (client or server), then compute percentiles over a window (e.g. last 1 minute). Tools like Prometheus (histogram_quantile) and Grafana are common for VDB monitoring.

How does distributed setup affect latency?

Latency includes network round-trips and coordinator wait time. Breaking down by component (coordinator, shard, network) helps identify bottlenecks. Compare with QPS.