What Is Prompt Caching?
Prompt Caching is a Claude API feature from Anthropic that caches a reusable portion of a prompt on the server side so subsequent API calls with the same prefix cost dramatically less and run much faster. According to Anthropic, caching can reduce input costs by up to 90% and latency by up to 85% for long-context, repeated requests — a game changer for chatbots, RAG systems, and agent workflows built on Claude.
The mental model is simple: imagine you send a 10,000-token product manual with every question your chatbot answers. Without caching, you pay the full input-token price every time. With Prompt Caching, the manual is processed once, stored server-side, and then referenced at roughly one-tenth the normal input price on every follow-up call within the cache’s time-to-live. This is what makes AI assistants backed by long system prompts affordable to run at scale.
How to Pronounce Prompt Caching
prompt kash-ing (/prɒmpt ˈkæʃɪŋ/)
prompt cache (/prɒmpt kæʃ/)
How Prompt Caching Works
Anthropic launched Prompt Caching in 2024 with an explicit breakpoint model: you mark positions inside your prompt where caching should occur, and Claude stores the processed internal state for reuse. The next request that begins with the same prefix skips the heavy processing and pays a reduced cache-read rate instead of the full input rate.
Cache Flow
How Prompt Caching works
(cache-write price)
(TTL 5m or 1h)
(cache-read price, ~10%)
Cache Breakpoints
You enable caching by attaching cache_control: {"type": "ephemeral"} to specific content blocks in the request. Up to four breakpoints can be placed along the prompt — typically one for the system persona, one for a long document, and one for a tool definition block. It is important to note that cache matching is byte-exact: even a single character change invalidates the cached prefix.
TTL
The standard TTL is five minutes with auto-renewal on each hit. A longer one-hour TTL is available via explicit opt-in. If no request touches the cache within the TTL, it expires. You should note that this short window makes Prompt Caching best-suited to interactive sessions and active agent workflows rather than infrequently-hit archival use cases.
Prompt Caching Usage and Examples
Prompt Caching works through the standard Claude Messages API. Below is an idiomatic Python example using Anthropic’s official SDK.
from anthropic import Anthropic
client = Anthropic()
LONG_MANUAL = "..." # ~10k tokens of internal documentation
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a helpful support agent."},
{
"type": "text",
"text": LONG_MANUAL,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "When is the expense report deadline?"}
]
)
print(response.content)
# usage.cache_creation_input_tokens (first request)
# usage.cache_read_input_tokens (subsequent requests)
The important thing to remember is to observe cache_creation_input_tokens and cache_read_input_tokens in the response’s usage field. These fields are the definitive signal that caching is actually being used.
Advantages and Disadvantages of Prompt Caching
Advantages
The headline benefit is cost reduction. Cache reads are billed at roughly 10% of normal input tokens (exact ratio varies by model). The second benefit is lower latency: reusing computed state eliminates redundant processing on large prompts. The third benefit is more efficient multi-turn conversations, since long chat histories can be cached instead of being reprocessed from scratch on every turn.
Disadvantages
There are real trade-offs. First, cache writes cost more than normal input tokens (up to 25% more on some models), so caching a one-off prompt is actively wasteful. Second, the short TTL makes caching unsuitable for rarely-accessed workloads. Third, byte-exact matching means dynamic fields like timestamps or user IDs placed inside a cached block will kill your hit rate. You should note that structuring your prompts with static content first and dynamic content last is essential.
Prompt Caching vs RAG
| Aspect | Prompt Caching | RAG |
|---|---|---|
| Primary goal | Cost and latency optimization | Dynamic access to large corpora |
| Doc volume | Tens to hundreds of thousands of tokens | Effectively unlimited (retrieved on demand) |
| Update cadence | Best for slow-changing content | Comfortable with frequent updates |
| Accuracy | Full-document reasoning | Dependent on retrieval quality |
The two techniques are complementary, not competing. A common production pattern is to use RAG to narrow the candidate set and Prompt Caching to reuse the narrowed context across multiple turns in the same session.
Common Misconceptions
Misconception 1: Prompt Caching always reduces costs
It only pays off when the cached prefix is reused multiple times within the TTL. Caching a single-use prompt increases costs because of the elevated write price.
Misconception 2: Cached prompts are persistent
The Anthropic cache is explicitly ephemeral. Once the TTL expires without a hit, the cached state is discarded — it is an optimization layer, not a storage tier.
Misconception 3: Caches are shared globally
Caches are scoped to your organization and to a specific model. Different API keys in the same org can share caches, but different models or orgs cannot.
Real-World Use Cases
In practice, Prompt Caching pays off in scenarios where the same large context is reused frequently within a short window: enterprise RAG assistants grounded in product manuals, legal and contract analyzers, code-aware agents that load an entire repository, documentation QA widgets, and internal help desks. Keep in mind that the ROI peaks when the same prefix is referenced ten or more times within five minutes, so monitoring your traffic patterns is an important first step before rolling it out broadly.
Frequently Asked Questions (FAQ)
Q1. Which models support Prompt Caching?
Caching is available on current Claude generations including Opus 4.6, Sonnet 4.6, and Haiku 4.5. Older models may or may not be supported; check the Anthropic docs for the current compatibility matrix.
Q2. What is the minimum cacheable size?
Minimum sizes vary by model — typically around 1,024 tokens. Smaller prompts ignore the cache marker because the overhead of caching outweighs the benefit.
Q3. How do I verify cache hits?
Inspect the usage object in the API response. The two fields cache_creation_input_tokens and cache_read_input_tokens tell you whether a write or a read occurred.
Q4. How does this differ from OpenAI Prompt Caching?
OpenAI caches the beginning of prompts automatically without user control. Anthropic uses explicit breakpoints, giving developers more precise control over what gets cached and where.
Q5. Can I use Prompt Caching with tool use?
Yes. Tool definitions are a common caching target because they tend to stay stable across many requests and consume meaningful token budget.
Conclusion
- Prompt Caching is Claude’s prompt-reuse optimization, delivering up to 90% cost reduction and 85% latency improvement.
- Developers control what gets cached via explicit
cache_controlbreakpoints (up to four per request). - TTLs are ephemeral — five minutes by default with a one-hour option for more persistent workflows.
- Cache writes cost more than normal input; only cache content you know will be reused.
- Combining Prompt Caching with RAG yields the best balance of cost, speed, and factual grounding.
- Typical breakpoints are placed after system persona, large documents, and tool definitions.
- Observe
cache_read_input_tokensto confirm your caching strategy is actually paying off.
References
📚 References
- ・Anthropic “Prompt Caching” Documentation https://docs.claude.com/
- ・Anthropic News “Prompt caching with Claude” https://www.anthropic.com/news/prompt-caching
- ・Anthropic API Reference https://docs.claude.com/en/api/






































Leave a Reply