What Is Prompt Caching? Claude’s API Feature for Reducing Cost and Latency

What Is Prompt Caching?

What Is Prompt Caching?

Prompt Caching is a Claude API feature from Anthropic that caches a reusable portion of a prompt on the server side so subsequent API calls with the same prefix cost dramatically less and run much faster. According to Anthropic, caching can reduce input costs by up to 90% and latency by up to 85% for long-context, repeated requests — a game changer for chatbots, RAG systems, and agent workflows built on Claude.

The mental model is simple: imagine you send a 10,000-token product manual with every question your chatbot answers. Without caching, you pay the full input-token price every time. With Prompt Caching, the manual is processed once, stored server-side, and then referenced at roughly one-tenth the normal input price on every follow-up call within the cache’s time-to-live. This is what makes AI assistants backed by long system prompts affordable to run at scale.

How to Pronounce Prompt Caching

prompt kash-ing (/prɒmpt ˈkæʃɪŋ/)

prompt cache (/prɒmpt kæʃ/)

How Prompt Caching Works

Anthropic launched Prompt Caching in 2024 with an explicit breakpoint model: you mark positions inside your prompt where caching should occur, and Claude stores the processed internal state for reuse. The next request that begins with the same prefix skips the heavy processing and pays a reduced cache-read rate instead of the full input rate.

Cache Flow

How Prompt Caching works

First request
(cache-write price)
Server stores state
(TTL 5m or 1h)
Subsequent requests
(cache-read price, ~10%)

Cache Breakpoints

You enable caching by attaching cache_control: {"type": "ephemeral"} to specific content blocks in the request. Up to four breakpoints can be placed along the prompt — typically one for the system persona, one for a long document, and one for a tool definition block. It is important to note that cache matching is byte-exact: even a single character change invalidates the cached prefix.

TTL

The standard TTL is five minutes with auto-renewal on each hit. A longer one-hour TTL is available via explicit opt-in. If no request touches the cache within the TTL, it expires. You should note that this short window makes Prompt Caching best-suited to interactive sessions and active agent workflows rather than infrequently-hit archival use cases.

Prompt Caching Usage and Examples

Prompt Caching works through the standard Claude Messages API. Below is an idiomatic Python example using Anthropic’s official SDK.

from anthropic import Anthropic

client = Anthropic()

LONG_MANUAL = "..."  # ~10k tokens of internal documentation

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a helpful support agent."},
        {
            "type": "text",
            "text": LONG_MANUAL,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "When is the expense report deadline?"}
    ]
)

print(response.content)
# usage.cache_creation_input_tokens (first request)
# usage.cache_read_input_tokens     (subsequent requests)

The important thing to remember is to observe cache_creation_input_tokens and cache_read_input_tokens in the response’s usage field. These fields are the definitive signal that caching is actually being used.

Advantages and Disadvantages of Prompt Caching

Advantages

The headline benefit is cost reduction. Cache reads are billed at roughly 10% of normal input tokens (exact ratio varies by model). The second benefit is lower latency: reusing computed state eliminates redundant processing on large prompts. The third benefit is more efficient multi-turn conversations, since long chat histories can be cached instead of being reprocessed from scratch on every turn.

Disadvantages

There are real trade-offs. First, cache writes cost more than normal input tokens (up to 25% more on some models), so caching a one-off prompt is actively wasteful. Second, the short TTL makes caching unsuitable for rarely-accessed workloads. Third, byte-exact matching means dynamic fields like timestamps or user IDs placed inside a cached block will kill your hit rate. You should note that structuring your prompts with static content first and dynamic content last is essential.

Prompt Caching vs RAG

Aspect Prompt Caching RAG
Primary goal Cost and latency optimization Dynamic access to large corpora
Doc volume Tens to hundreds of thousands of tokens Effectively unlimited (retrieved on demand)
Update cadence Best for slow-changing content Comfortable with frequent updates
Accuracy Full-document reasoning Dependent on retrieval quality

The two techniques are complementary, not competing. A common production pattern is to use RAG to narrow the candidate set and Prompt Caching to reuse the narrowed context across multiple turns in the same session.

Common Misconceptions

Misconception 1: Prompt Caching always reduces costs

It only pays off when the cached prefix is reused multiple times within the TTL. Caching a single-use prompt increases costs because of the elevated write price.

Misconception 2: Cached prompts are persistent

The Anthropic cache is explicitly ephemeral. Once the TTL expires without a hit, the cached state is discarded — it is an optimization layer, not a storage tier.

Misconception 3: Caches are shared globally

Caches are scoped to your organization and to a specific model. Different API keys in the same org can share caches, but different models or orgs cannot.

Real-World Use Cases

In practice, Prompt Caching pays off in scenarios where the same large context is reused frequently within a short window: enterprise RAG assistants grounded in product manuals, legal and contract analyzers, code-aware agents that load an entire repository, documentation QA widgets, and internal help desks. Keep in mind that the ROI peaks when the same prefix is referenced ten or more times within five minutes, so monitoring your traffic patterns is an important first step before rolling it out broadly.

Frequently Asked Questions (FAQ)

Q1. Which models support Prompt Caching?

Caching is available on current Claude generations including Opus 4.6, Sonnet 4.6, and Haiku 4.5. Older models may or may not be supported; check the Anthropic docs for the current compatibility matrix.

Q2. What is the minimum cacheable size?

Minimum sizes vary by model — typically around 1,024 tokens. Smaller prompts ignore the cache marker because the overhead of caching outweighs the benefit.

Q3. How do I verify cache hits?

Inspect the usage object in the API response. The two fields cache_creation_input_tokens and cache_read_input_tokens tell you whether a write or a read occurred.

Q4. How does this differ from OpenAI Prompt Caching?

OpenAI caches the beginning of prompts automatically without user control. Anthropic uses explicit breakpoints, giving developers more precise control over what gets cached and where.

Q5. Can I use Prompt Caching with tool use?

Yes. Tool definitions are a common caching target because they tend to stay stable across many requests and consume meaningful token budget.

Conclusion

  • Prompt Caching is Claude’s prompt-reuse optimization, delivering up to 90% cost reduction and 85% latency improvement.
  • Developers control what gets cached via explicit cache_control breakpoints (up to four per request).
  • TTLs are ephemeral — five minutes by default with a one-hour option for more persistent workflows.
  • Cache writes cost more than normal input; only cache content you know will be reused.
  • Combining Prompt Caching with RAG yields the best balance of cost, speed, and factual grounding.
  • Typical breakpoints are placed after system persona, large documents, and tool definitions.
  • Observe cache_read_input_tokens to confirm your caching strategy is actually paying off.

References

📚 References