What Is KV Cache?
The KV Cache is the optimization that stores and reuses Key and Value tensors computed during Transformer LLM inference, so each new token only has to compute attention for itself rather than for the entire prior sequence. Without it, every new token recomputes attention over all preceding tokens — a quadratic cost. With it, the cost stays linear in sequence length. Important: this is the single most impactful inference-time optimization in modern LLM serving, and you should keep it in mind whenever you reason about throughput or memory.
A useful analogy: the KV Cache is a scratchpad the model fills in once and consults thereafter. After the first forward pass, it has noted the Key and Value vectors for every token; subsequent passes only add a single new row. Hugging Face’s measured numbers make the impact concrete: generating 1,000 tokens takes roughly 11.9 seconds with KV caching versus 56.2 seconds without — a 4.7x gap that widens as sequences grow longer. Note that the trade-off is memory: each cached attention layer consumes GPU VRAM proportional to context length, which is why KV Cache size is the dominant constraint on serving concurrency for long-context models.
How to Pronounce KV Cache
K-V cache (/keɪ viː kæʃ/)
kay-vee cache (/keɪ viː kæʃ/)
How KV Cache Works
Self-attention has every token attend to every other token. Decoders are causal: token t’s attention depends only on tokens 1 through t. The KV Cache exploits this by retaining past Key and Value vectors and computing fresh attention only for the new token. Important: this is what turns a naively quadratic operation into a linear one for autoregressive generation.
KV Cache step
Properties and memory cost
| Property | Value / Notes |
|---|---|
| Target architecture | Decoder-only Transformers (GPT, Claude, Llama, Gemini) |
| What is cached | Per-layer Key and Value tensors (Query is not cached) |
| Memory footprint | 2 × layers × heads × head_dim × tokens × bytes_per_element |
| Compute complexity | With KV cache: O(n); without: O(n²) |
| Common optimizations | PagedAttention (vLLM), FlashAttention, Multi-Query Attention |
| Adjacent concepts | Prompt Caching (API-level), KV cache quantization |
| Typical bottleneck | VRAM at long context length and concurrent-request capacity |
Why complexity drops to O(n)
Without a KV Cache, generating token n recomputes Key and Value vectors for all n prior tokens, summing to n × n = n² operations. With the cache, you skip the recomputation and only do attention between the new token’s Query and the cached n Key vectors — n operations. You should keep in mind that this is the analytical reason engineers always enable KV caching during inference, even when the implementation cost is non-trivial.
Memory versus speed trade-off
The KV Cache trades VRAM for compute. Important: a Llama 3 70B model serving a 32,000-token context consumes roughly 20-40 GB of KV Cache per request. Stacking concurrent requests therefore quickly saturates GPU memory, which is why optimizations like PagedAttention, KV quantization, and Multi-Query Attention exist — they reduce the per-request memory footprint so more requests can share the same hardware.
Relationship to PagedAttention and vLLM
vLLM’s PagedAttention treats the KV Cache like an operating system’s virtual memory, managing it in pages to eliminate fragmentation. Compared with naive contiguous allocation, this typically increases serving concurrency by a factor of two to four. Note that PagedAttention is the single largest reason vLLM has become the de facto open-source LLM serving stack — without it, throughput would be GPU-bound far earlier than it actually is.
FlashAttention and the broader context
FlashAttention is a different optimization aimed at attention compute, not at the KV Cache itself, but the two interact closely. FlashAttention restructures the attention computation to fit in fast on-chip memory (SRAM), reducing GPU memory bandwidth usage. KV Cache reduces the amount of work that needs to be done in the first place. Production serving stacks typically use both: KV Cache to skip redundant work, FlashAttention to make the remaining work cheaper. You should keep in mind that they are complementary, not alternatives.
KV Cache Usage and Examples
Quick start
Major inference libraries enable the KV Cache by default. In Hugging Face Transformers, all you need is the standard generate() call with use_cache=True (the default).
# Minimal Hugging Face example with KV Cache
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
).to("cuda")
inputs = tokenizer("Explain the KV cache", return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=200,
use_cache=True, # default
)
print(tokenizer.decode(outputs[0]))
Common Implementation Patterns
Pattern A: vLLM with PagedAttention plus KV Cache
# Production-grade serving with vLLM
pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", gpu_memory_utilization=0.9)
prompts = ["Hello", "How does KV cache work?", "Talk about GPU memory"]
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))
for o in outputs:
print(o.outputs[0].text)
When to use it: self-hosted GPU serving with high concurrency requirements; latency-sensitive chatbots. You should keep in mind that vLLM is the standard choice for production open-weight model deployment in 2026.
When to avoid it: single-shot research where the vLLM startup cost outweighs its serving benefits.
Pattern B: KV cache quantization for memory savings
# INT8-quantized KV cache via bitsandbytes
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quant_config)
# KV cache also held in 8-bit, halving VRAM for long-context generation
When to use it: long contexts (32K+) where memory pressure constrains concurrency. Important: the small accuracy hit from KV quantization is acceptable for most chat workloads.
Anti-pattern: disabling the KV Cache for long generation
# Don't do this in production
outputs = model.generate(**inputs, max_new_tokens=2000, use_cache=False)
# A one-second job stretches to over a minute
Outside of intentional research, never disable the KV Cache. Important: a common debugging mistake is to leave use_cache=False from a unit test and ship it; check this whenever you observe unexplained slowdowns at long context.
Implementation Pattern: streaming output that writes the cache incrementally
# Streaming token-by-token while the cache grows
streamer = TextIteratorStreamer(tokenizer)
generation_kwargs = {**inputs, "max_new_tokens": 200, "streamer": streamer, "use_cache": True}
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for token in streamer:
print(token, end="", flush=True)
Note that streaming and KV caching are complementary — the streamer surfaces tokens as the cache grows incrementally, giving users early feedback at no extra compute cost.
Advantages and Disadvantages of KV Cache
Advantages
- Compute drops to O(n): long-form generation stays tractable. Important: this is the property that makes long-context chatbots feasible.
- Default-on across libraries: Hugging Face, vLLM, TensorRT-LLM, llama.cpp all enable it out of the box.
- Composes with streaming: incremental cache growth aligns naturally with token-by-token output.
- Compatible with quantization and PagedAttention: a deep optimization toolbox is available.
Disadvantages
- VRAM scales with context: long contexts and large models hit memory limits quickly.
- Concurrency constraint: large per-request caches reduce how many requests fit on one GPU.
- Fragmentation in naive implementations: contiguous allocation wastes memory; PagedAttention solves this but adds complexity.
- Distributed serving complexity: sharding the cache across GPUs requires careful synchronization design.
KV Cache vs Prompt Caching vs Embedding Cache
Three “cache” concepts in modern LLM stacks are commonly confused. The table below distinguishes them.
| Aspect | KV Cache | Prompt Caching | Embedding Cache |
|---|---|---|---|
| Layer | Inside the model (attention) | API / serving layer | Application layer |
| What is stored | Key/Value tensors | Pre-processed prompt prefix | Vector embeddings |
| Scope | Within one request, token-by-token | Shared across requests | Persistent (DB) |
| Primary effect | Inference speed (n² to n) | Reduces API input token billing | Avoids re-embedding work |
| Configuration | Library default ON | API-level cache_control | Application code |
| Typical savings | Several to 10x faster | Up to 90% input cost on hits | Seconds to minutes of compute |
Mental model: KV Cache speeds up the model itself, Prompt Caching reduces API billing for repeated prompts, and Embedding Cache is an application-side optimization for RAG. Important: in production all three are typically used together — they are complementary rather than alternatives, and you should keep this in mind when designing the cost and latency strategy for an LLM app.
Common Misconceptions
Misconception 1: “KV Cache and Prompt Caching are the same thing”
Why people get confused: both share the words “cache” and “prompt” and operate in the LLM stack, so engineers reasonably assume they are the same mechanism. The reason is shared vocabulary creating false equivalence.
Reality: KV Cache is an internal model optimization that speeds up attention computation, automatically enabled by every modern inference library. Prompt Caching is an API-level feature from vendors like Anthropic and OpenAI that reuses pre-processed prompt prefixes across requests to reduce token billing. They live at different layers and complement each other.
Misconception 2: “KV Cache saves GPU memory”
Why people get confused: the word “cache” carries an implicit “memory efficiency” connotation that misleads here. The reason is general-purpose computing experience confused with model-internal mechanics.
Reality: the KV Cache increases GPU memory usage. It saves compute, paying with VRAM. A Llama 70B serving 32K context can use 20-40 GB of VRAM per request just for the cache, which is why concurrent-request capacity is so often the binding constraint in production serving.
Misconception 3: “KV Cache is a recent invention”
Why people get confused: PagedAttention, FlashAttention, and vLLM are all recent, so the entire ecosystem feels new. The reason is recency confused with novelty.
Reality: the KV Cache concept has existed since the original Transformer (2017) and was used in early GPT-2 inference implementations. The recent advances are in how to manage the cache efficiently — PagedAttention, KV quantization, Multi-Query Attention, Grouped-Query Attention — not in the cache idea itself.
Real-World Use Cases
The scenarios where KV Cache delivers the biggest measurable wins are below. Important: each pattern below assumes you are running a Transformer LLM either in-house or via a hosted API.
Long-form generation
Summarization, code generation, and report drafting are token-heavy workloads where the quadratic-to-linear improvement is most visible. Without a KV Cache, a 2,000-token report draft can be 30x slower; with it, the latency is dominated by the per-token forward pass alone. You should keep in mind that user-perceived performance on long generations lives or dies by this optimization.
Multi-turn chat
Each new turn appends to the cache rather than recomputing from scratch. This is what makes chatbots feel responsive even after dozens of turns. Note that some serving stacks evict idle conversations from VRAM to manage concurrency, so understanding the eviction policy matters when you size GPU memory for a chat product.
Long-context RAG inference
RAG pipelines often inject 4K-32K tokens of retrieved context before the model starts generating. Without a KV Cache, every output token would re-scan that context — utterly impractical at scale. The KV Cache is what makes long-context RAG affordable in production.
Streaming responses
Token-by-token streaming UX naturally maps to incremental cache growth. The user sees the first token quickly because only that token’s compute matters, while the cache continues building during the rest of the response.
High-throughput serving on owned GPUs
Self-hosted serving stacks like vLLM combine the KV Cache with PagedAttention to maximize concurrent requests per GPU. You should keep in mind that this combination is the reason vLLM has become the de facto open-weight serving stack — naive KV management leaves significant throughput on the table.
Sizing the KV Cache for Production
Knowing the formula for KV Cache memory is essential for capacity planning. The size for one request is approximately: 2 × num_layers × num_heads × head_dim × sequence_length × bytes_per_element. The leading factor of 2 accounts for both Key and Value tensors. For Llama 3 70B (80 layers, 64 heads, 128 head dim, FP16), one request at 32K tokens consumes about 20 GB just for KV. Important: if your GPU has 80 GB of VRAM and the model weights occupy 40 GB, you have only enough headroom for one or two concurrent 32K-context requests, which is the kind of capacity constraint that drives architectural decisions.
Strategies to reduce per-request memory include Multi-Query Attention (sharing K/V across heads, used by Llama 3 and PaLM), Grouped-Query Attention (a middle ground used by Llama 2 70B), KV cache quantization to INT8 or even INT4, and PagedAttention to eliminate fragmentation. You should keep in mind that these strategies stack — combining MQA with INT8 KV and PagedAttention is what lets vLLM serve 4-8x more concurrent users than naive Hugging Face Transformers.
KV Cache and Prompt Caching Combined
The KV Cache is internal to the model and reset between requests. Prompt Caching, in contrast, persists pre-computed activations across requests at the API or serving layer. The two work beautifully together: when a Prompt Cache hit happens, the API server hands the model a pre-warmed KV state for the cached prefix, so the model only does work for the new tokens of the current request. Important: this is the architectural reason Anthropic and OpenAI Prompt Caching delivers such dramatic input-token cost reductions on repeated prompts — the underlying compute (the KV Cache fill for the prefix) only happens once.
For application designers, the practical implication is to maximize Prompt Cache hits by structuring prompts with stable prefixes (system instructions, knowledge base content, few-shot examples) followed by variable suffixes (the user’s current question). You should keep in mind that the same structure also benefits human readability and prompt versioning, so it is a win on multiple axes.
Looking Ahead: Where KV Cache Optimization Goes Next
Three frontiers are being actively pushed in 2026. First, more aggressive quantization — INT4 KV is now production-viable for several model families, and INT2 research is underway. Second, smarter eviction policies — research suggests that not all tokens need full-precision K/V retention, and “important” tokens can be kept while less-important ones are evicted or compressed. Third, hierarchical KV stores that span GPU VRAM, CPU DRAM, and even fast storage, allowing very long contexts (1M+ tokens) at acceptable cost.
For practitioners, the takeaway is that KV Cache optimization remains a moving target. You should keep in mind that what is state-of-the-art today may be obsolete in 12 months, so building infrastructure that can absorb new techniques (e.g., abstracting the inference layer behind vLLM rather than depending on a specific Hugging Face version) pays off. Important: this is the same story as compiler and database optimization — the abstractions matter as much as the raw mechanism.
Frequently Asked Questions (FAQ)
Q1. Can I save GPU memory by disabling the KV Cache?
Disabling it reduces per-request VRAM but makes generation cost quadratic in sequence length — practically unusable for anything beyond very short outputs. Generating 2,000 tokens with a KV Cache is roughly 24 seconds on a typical setup; without it, 2+ minutes. The right approach is to keep the cache enabled and use quantization or PagedAttention to control memory.
Q2. How do I decide between KV Cache and Prompt Caching?
There is no choice — they operate at different layers. Inference libraries handle KV Cache automatically. Prompt Caching is a separate, vendor-side feature you enable to reduce billing for repeated prompts. Use both: Prompt Caching reduces API cost on repeated prefixes, while KV Cache makes the underlying compute linear.
Q3. If I use vLLM, do I need to think about KV Cache details?
vLLM handles PagedAttention and KV Cache management automatically. You don’t need to implement the low-level details. However, understanding why long contexts reduce concurrency, or what KV cache quantization does, helps with capacity planning and operational debugging.
Q4. How do KV Cache, Attention Sinks, and StreamingLLM relate?
Attention Sinks is the observation that retaining the first few tokens’ KV indefinitely preserves quality even when other entries are evicted. StreamingLLM builds on this idea to handle effectively infinite context with constant-size KV cache. As of 2026 these have moved from research to production-grade implementations.
Q5. If I use a cloud API like Claude or GPT-5, do I need to care about KV Cache?
API users don’t configure KV Cache directly. However, knowing it exists helps you understand why long prompts have higher latency, what Prompt Caching is actually doing under the hood, and how to architect prompts for cost and speed efficiency.
Conclusion
- KV Cache is the essential optimization that makes Transformer LLM inference linear instead of quadratic.
- It stores per-layer Key and Value tensors and reuses them for every subsequent token.
- VRAM consumption is significant — long-context, large models bottleneck on concurrency.
- PagedAttention (vLLM) eliminates fragmentation; KV quantization (INT8/INT4) reduces footprint further.
- Operates at the model layer, distinct from Prompt Caching (API) and Embedding Cache (application).
- Default-on in every modern inference library; only disable for research or debugging.
References
📚 References
- ・Hugging Face, “KV Caching Explained”, https://huggingface.co/blog/not-lain/kv-caching
- ・Sebastian Raschka, “Understanding and Coding the KV Cache in LLMs from Scratch”, https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms
- ・NVIDIA, “Mastering LLM Techniques: Inference Optimization”, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- ・Pierre Lienhart, “LLM Inference Series: KV caching explained”, https://medium.com/@plienhar/llm-inference-series-3-kv-caching-unveiled-048152e461c8
- ・Morph, “KV Cache Explained”, https://www.morphllm.com/kv-cache-explained







































Leave a Reply