← All topics

Ecosystem & Advanced Topics · Topic 187

Vector Databases for LLM Caching (GPTCache)

LLM caching stores past prompt–response pairs so that when a new prompt is semantically similar to a cached one, the system can return the cached response (or a refined version) instead of calling the LLM again. Projects like GPTCache use a vector database to index embeddings of prompts and look up the nearest cached entry by similarity.

Summary

  • LLM caching stores past prompt–response pairs; when a new prompt is semantically similar to a cached one, return the cached response instead of calling the LLM. GPTCache and similar projects use a vector database to index prompt embeddings and look up the nearest cached entry. See RAG and nearest-neighbor.
  • Flow: embed prompt → VDB nearest-neighbor over cached prompts → if match above threshold, return cached response; else call LLM and insert. Benefits: lower latency and cost; considerations: threshold choice, consistent embedding and normalization, cache consistency.
  • Pipeline: embed → lookup → threshold check → return or call LLM. Practical tip: set threshold high enough to avoid wrong reuse; use same embedding model and normalization as at insert.

How LLM caching with a VDB works

Flow: (1) Incoming prompt is embedded. (2) VDB performs nearest-neighbor search over cached prompt embeddings. (3) If the best match is above a similarity threshold, the associated response is returned (or re-used); otherwise the LLM is called and the new pair is inserted into the cache. (4) Cache eviction (LRU, TTL, or size-based) can be applied on the VDB or an outer layer. Because prompts are matched by meaning, not exact string, “What is the capital of France?” and “Capital of France?” can hit the same cache entry.

Trade-off: a low similarity threshold increases hit rate but risks returning a cached response that does not quite match the user intent; a high threshold keeps quality but reduces cache benefit. Use the same embedding model at insert and lookup so scores are comparable.

Benefits and considerations

Benefits: lower latency and cost for repeated or near-duplicate queries; reduced load on the LLM API. Considerations: choice of similarity threshold (too low risks wrong reuse; too high reduces hit rate); embedding model and normalization must be consistent; and cache consistency if the LLM or context changes. A lightweight VDB or in-process index is often enough for GPTCache-style workloads.

Pipeline: embed → lookup → threshold check → return or call LLM. Practical tip: set threshold high enough to avoid wrong reuse; use same embedding model and normalization as at insert.

Frequently Asked Questions

What is LLM caching with a vector database?

Storing past prompt–response pairs so that when a new prompt is semantically similar to a cached one, the system returns the cached response instead of calling the LLM. Projects like GPTCache use a VDB to index prompt embeddings and do nearest-neighbor lookup. See RAG and semantic search.

How does the cache lookup work?

Incoming prompt is embedded; the VDB performs nearest-neighbor search over cached prompt embeddings. If the best match is above a similarity threshold, the associated response is returned; otherwise the LLM is called and the new pair is inserted. Prompts are matched by meaning, so paraphrases can hit the same entry.

What are the main considerations?

Choice of similarity threshold (too low risks wrong reuse; too high reduces hit rate); embedding model and normalization must be consistent; cache consistency when the LLM or context changes. A lightweight VDB or in-process index is often enough for GPTCache-style workloads. See semantic search.

When is LLM caching useful?

For repeated or near-duplicate queries: lower latency and cost, reduced load on the LLM API. Fits well with RAG and chatbot applications where many users ask similar questions. See use cases.