What Is the Message Batches API? A Complete Guide to Anthropic’s 50%-Discount Async Processing API for Claude, Its Workflow, Limits, and How It Differs from the Standard Messages API – IT Glossary Plus

Q: How much cheaper is the Message Batches API compared with the standard Messages API?

Both input and output tokens are billed at exactly 50% off the standard list price. For Claude Sonnet 4.6 that means $1.50/$7.50 per million tokens instead of $3/$15. For very large offline jobs the cost savings translate directly to half the invoice.

Q: When do batch results come back?

The official maximum is 24 hours, but in practice most batches finish in under one hour. Because the SLA is 24 hours and not, say, 4 hours, you should always design clients that tolerate the worst case.

Q: How many requests can fit in a single batch?

Up to 100,000 requests or 256 MB of payload — whichever you hit first. Larger workloads need to be sharded into multiple batches.

Q: Is the answer quality lower than the standard API?

No. Same model and same parameters return the same response distribution. The discount comes from scheduling flexibility, not from a degraded model.

Q: Can I combine batching with Prompt Caching?

Yes. The batch 50% discount stacks on top of Prompt Caching's 90% read discount. For evaluation jobs that share a huge system prompt this can drop the effective price to less than 5% of the original Messages API rate.

What Is the Message Batches API?

The Message Batches API is Anthropic’s asynchronous batch processing endpoint for Claude. It lets you submit up to 100,000 individual Messages-API-style requests in a single job and receive every response within 24 hours, at a flat 50% discount on both input and output tokens compared with the standard Messages API. If your workload doesn’t need millisecond-latency answers, the Batches API is the cheapest legitimate way to run Claude at scale.

A useful analogy: the standard Messages API is like checking out one item at a time at a convenience-store register, whereas the Batches API is more like booking a freight courier — slower, but much cheaper per package. Anthropic’s positioning is that “real-time pricing should reflect real-time scheduling cost,” which is why deferred work gets a structural discount. OpenAI ships an equivalent Batch API and Google offers a similar batch endpoint for Gemini, so familiarity with this pattern is now table stakes for production LLM teams.

How to Pronounce Message Batches API

Message Batches API (/ˈmɛs.ɪdʒ ˈbæ.tʃɪz eɪ.piːˈaɪ/)

Batches API (short form)

How the Message Batches API Works

The Message Batches API became generally available in October 2024 and is implemented as an asynchronous queue on top of the same Claude inference platform that serves the synchronous Messages API. Submitting a request returns immediately with a batch identifier; results are delivered in JSONL once the job has been scheduled, executed, and verified.

Processing Pipeline

Message Batches API workflow

1. Build JSONL of requests

→

2. POST /v1/messages/batches

→

3. Job enters queue (in_progress)

→

4. Completes within 24h (ended)

→

5. Stream JSONL results

Every request inside a batch must carry a custom_id. This identifier is the only thing that ties an output back to its input, so it must be unique within the batch and meaningful in your domain (document_id, user_id, eval_case_id, etc.). It is important to choose this naming scheme up front — refactoring it later when results have already been written to a downstream store is painful.

Quotas and Limits

Property	Limit
Requests per batch	100,000
Total payload per batch	256 MB
Maximum completion time	24 hours (most batches finish < 1 hour)
Result retention	29 days after completion
Supported models	Opus 4.6, Sonnet 4.6, Haiku 4.5 and other current Claude models
Extended output	Up to 300K output tokens per request via beta header

Message Batches API Usage and Examples

Basic Quick Start

import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "doc-001",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Summarize: ..."}
                ]
            }
        },
        {
            "custom_id": "doc-002",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": "Summarize: ..."}
                ]
            }
        }
    ]
)
print(batch.id, batch.processing_status)

Polling and Streaming Results

import time

while True:
    b = client.messages.batches.retrieve(batch.id)
    if b.processing_status == "ended":
        break
    time.sleep(60)

for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        text = result.result.message.content[0].text
        print(result.custom_id, text)
    elif result.result.type == "errored":
        print(result.custom_id, "ERROR", result.result.error)

Common Implementation Patterns

Pattern A: Bulk document summarization

requests = []
for doc_id, doc_text in corpus.items():
    requests.append({
        "custom_id": doc_id,
        "params": {
            "model": "claude-haiku-4-5",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": f"Summarize in 3 lines: {doc_text}"}]
        }
    })
batch = client.messages.batches.create(requests=requests)

Good fit: Tens of thousands of documents to classify, summarize, or label offline. Important: the cost halves directly, which often turns a $10K/month workload into a $5K one.

Bad fit: Anything in front of a chat user. Note that asking a user to wait up to 24 hours for an answer is not a UX you should ship.

Pattern B: Batching with Prompt Caching

requests = [{
    "custom_id": f"q-{i}",
    "params": {
        "model": "claude-sonnet-4-6",
        "max_tokens": 1024,
        "system": [{
            "type": "text",
            "text": "(huge shared system prompt, 10K tokens)",
            "cache_control": {"type": "ephemeral"}
        }],
        "messages": [{"role": "user", "content": q}]
    }
} for i, q in enumerate(questions)]

Good fit: Evaluation jobs that re-run the same massive system prompt across thousands of test cases. Stacking the 50% batch discount with the 90% prompt cache read discount can drop your effective rate to ~5% of the standard Messages API price.

Bad fit: When the system prompt varies per request — there is nothing to cache and the extra complexity buys nothing.

Anti-pattern: Polling status every second

# DO NOT DO THIS
while True:
    b = client.messages.batches.retrieve(batch.id)
    if b.processing_status == "ended":
        break
    # 1-second polling -> wasted requests, possible rate limits

Batches take at least many minutes — often longer — so poll every 1–10 minutes. In production, schedule a periodic checker (EventBridge, Cloud Scheduler, GitHub Actions cron) rather than holding a long-lived loop in your application code. Keep in mind that hammering retrieval endpoints is a textbook misuse of an asynchronous API.

Advantages and Disadvantages of the Message Batches API

Advantages

Flat 50% discount on both input and output tokens, applied to every supported model. Important to remember when you are budgeting for synthetic data generation, classification, and offline evals.
Massive jobs in a single API call (up to 100K requests). The orchestration code on your side basically disappears.
Stacks with Prompt Caching for additional cost savings, sometimes pushing effective cost below 5% of the on-demand rate.
Independent rate-limit pool — batch traffic does not consume the RPM/TPM quotas your synchronous production workloads rely on.
Simple result format (JSONL with custom_id) that is easy to load into BigQuery, Snowflake, or any data lake.

Disadvantages

No real-time delivery. The 24-hour SLA makes the API unusable for chat UIs or interactive agents — keep that in mind before reaching for it.
Result retention is only 29 days; persist outputs to your own storage immediately after retrieval.
No streaming — you cannot tail tokens as they are produced. Important for any code generation flow that wants progressive UX.
Cancellation is best-effort. Once the job is processing, you cannot cancel mid-flight without paying for already-consumed tokens.

Message Batches API vs Standard Messages API

You should think of these two APIs as siblings rather than alternatives: same model, same prompt format, same answers, but different scheduling guarantees. The table below summarizes when each one fits.

Aspect	Message Batches API	Standard Messages API
Mode	Asynchronous batch	Synchronous
Pricing	50% off list price	List price
Latency	Up to 24 hours (often < 1 hour)	Seconds to tens of seconds
Requests per call	100,000 / 256 MB	1
Streaming	Not supported	Server-sent events
Typical use case	Bulk summarization, evals, dataset generation	Chat UIs, agents, IDE assistants
Rate-limit pool	Independent	Shared production RPM/TPM

The simple rule: if you can wait, batch it. Modern Claude operations push every workload that doesn’t need real-time response into Batches by default — that single decision frequently halves the inference bill.

Common Misconceptions

Misconception 1: “Batch results are lower quality than synchronous results.”

Why people get confused: When humans hear “50% cheaper” they instinctively assume something must be worse. Cloud computing reinforces this with services like spot instances, where the price drop comes at the cost of availability — the reason this analogy seeps into LLM thinking.

The reality: Anthropic’s documentation is explicit that batch and synchronous responses come from the same model with the same sampling parameters. There is no degradation in quality, accuracy, or determinism — only in delivery time.

Misconception 2: “Batch means unlimited.”

Why people get confused: The naming, plus the fact that batch traffic doesn’t share the RPM/TPM quotas of the standard API, leads newcomers to believe there are no caps. Combined with the headline 50% discount, this creates a misleading mental model of an “unlimited cheap pipe.”

The reality: Each batch is capped at 100K requests / 256 MB, and your account has a maximum number of concurrent in-flight batches. Sharding and back-pressure are real concerns at scale.

Misconception 3: “Results always arrive at the 24-hour mark.”

Why people get confused: The “24-hour” phrasing is a hard SLA, but many readers misread it as a fixed delivery time rather than an upper bound — perhaps because retail shipping companies brand “24-hour delivery” as a guarantee of timing rather than a worst-case bound.

The reality: Anthropic publishes 24 hours as the maximum. In production most batches complete in under one hour. Design for the 24-hour worst case, but don’t hard-code a one-day delay into your pipeline.

Real-World Use Cases

Bulk content classification

Hundreds of thousands of customer reviews can be tagged for sentiment, intent, and topic in a few hours of batch time. Cost halves and your synchronous quota stays clean for the live product. Important to remember that this is one of the highest-ROI applications of the Batches API in production environments today, and many SaaS companies have moved their nightly classification jobs to this endpoint specifically for the discount. The pattern is straightforward: enqueue every record updated in the last 24 hours, kick off the batch from a cron job, and persist the resulting labels back into the data warehouse before the next morning’s analytics jobs start.

Evaluation harnesses

Re-running 5,000 prompt regression tests after every prompt change becomes financially trivial when the bill is half the list price. This is the canonical use case behind Anthropic Workbench’s eval features. Teams running disciplined prompt engineering practices typically maintain a “golden set” of 1,000–10,000 representative inputs and re-evaluate the entire set whenever a prompt is changed. Without batching, that single regression run could cost tens of dollars per iteration; with batching it’s pocket change, which means engineers actually run it. Note that the speed of iteration on prompt quality is the secondary, and arguably more important, benefit beyond just the cost savings.

Synthetic data generation

Producing fine-tuning datasets, instruction-tuning examples, or before/after refactor pairs at industrial scale. Important for organizations training their own models or distilling behavior into smaller open-weight models. A common workflow is to use Claude Opus 4.6 (the most capable model) in batch mode to generate a high-quality synthetic dataset, which is then used to fine-tune a smaller, cheaper model that can be deployed for real-time inference. The 50% discount makes the up-front data generation cost manageable, and the resulting model serves traffic at a fraction of the cost of running Opus directly.

Multilingual news pipelines

News organizations submit thousands of articles for translation each evening and retrieve results before the next morning’s editorial cycle. The asynchronous SLA aligns naturally with the publishing rhythm. Note that batch processing pairs well with prompt caching here: the translation system prompt (style guide, glossary, brand voice rules) often runs to several thousand tokens, and is identical across every article in the batch. With caching the per-article token cost can be reduced by an order of magnitude beyond the batch discount itself.

RAG corpus enrichment

Vector database pipelines often need each document chunk to be enriched with summaries, generated questions, or topic labels before being indexed. Batches API is a perfect fit because the enrichment is offline, the volume is large, and the work is parallelizable. You should keep in mind that custom_id naturally maps to your chunk_id, making the round-trip integration trivial. Many teams report that adding a Claude-generated “hypothetical question” field to each chunk improves retrieval recall significantly, and the Batches API makes the cost of doing so for millions of chunks reasonable.

Customer support triage backfill

When a support team rolls out a new triage taxonomy, every historical ticket needs to be re-classified to make trend reports comparable. Important to do this as a single, well-defined batch rather than as ad-hoc backfill jobs that consume the live API quota and risk impacting customer-facing chat workloads. A 100K-ticket backfill in a single batch is a one-line decision rather than a multi-week engineering project.

Compliance and content moderation re-scoring

Trust and safety teams periodically need to re-score a corpus of historical content against an updated policy. Important to remember that the asynchronous nature of Batches API matches well with the typical “run a re-score job over a quarter’s worth of data, review results, ship policy update” workflow. The 50% discount makes it financially viable to run these jobs more often, which in turn keeps policies more current.

Operational Best Practices

Running the Message Batches API in production requires more discipline than calling the synchronous Messages API. Important to remember that with synchronous calls, errors surface immediately and are easy to retry. With batches, the entire feedback loop runs over hours, which means failures are slower to discover and harder to recover from. The following practices are what experienced teams converge on after a few months of running batched workloads at scale.

Persist results immediately on retrieval

Anthropic only retains batch results for 29 days after the job ends. You should write retrieved results to durable storage (S3, GCS, a database) as soon as they arrive, ideally inside the same retrieval loop. Treat the API as a delivery channel, not a storage system. Important: a common production failure mode is “we retrieved the results but lost them because someone restarted the worker before the data was committed downstream.” Use idempotent writes keyed by custom_id to make retrieval safe to retry.

Validate request payloads before submission

You cannot fix a malformed request after a batch starts processing — the entire request will fail and you will be charged nothing for that line, but you will need to resubmit it as part of a new batch. Note that the small upfront cost of running a Pydantic or JSON Schema validation pass on every request before submission saves significant operational pain when even one request out of 100,000 is malformed. Important to define the schema once in code and reuse it across both client and server.

Use exponential backoff for retrieval

If the API returns 429 or 5xx during result retrieval, retry with exponential backoff. Note that retrieval is a read-only operation, so retries are always safe. Production-grade implementations cap the maximum retry duration so that a stuck batch doesn’t stall the entire pipeline indefinitely; pair this with an alarm that fires when retrieval has been retrying for more than the SLA window.

Monitor batch-level metrics

Track total batches submitted, in-progress count, average completion time, and failure rate per day. Important to alert when failure rate exceeds 1% or when in-progress count grows unboundedly — both are signals that something is wrong upstream. You should also monitor the cost-per-result metric over time to catch unexpected token-usage growth.

Plan for the 24-hour worst case

The SLA is 24 hours, not “usually a few minutes.” Production pipelines that depend on batch results should be designed so that nothing critical happens until results are confirmed retrieved. Important to never schedule a downstream consumer with a hard deadline tighter than 24 hours after batch submission unless you have explicit fallback logic for the late-result case. Note that this matters most when batches are part of a customer-facing SLA: bake in the worst case from day one.

Frequently Asked Questions (FAQ)

Q1. How much cheaper is the Message Batches API compared with the standard Messages API?

Both input and output tokens are billed at exactly 50% off the standard list price. For Claude Sonnet 4.6 that means $1.50/$7.50 per million tokens instead of $3/$15. For very large offline jobs the cost savings translate directly to half the invoice.

Q2. When do batch results come back?

The official maximum is 24 hours, but in practice most batches finish in under one hour. Because the SLA is 24 hours and not, say, 4 hours, you should always design clients that tolerate the worst case.

Q3. How many requests can fit in a single batch?

Up to 100,000 requests or 256 MB of payload — whichever you hit first. Larger workloads need to be sharded into multiple batches.

Q4. Is the answer quality lower than the standard API?

No. Same model and same parameters return the same response distribution. The discount comes from scheduling flexibility, not from a degraded model.

Q5. Can I combine batching with Prompt Caching?

Yes. The batch 50% discount stacks on top of Prompt Caching’s 90% read discount. For evaluation jobs that share a huge system prompt this can drop the effective price to less than 5% of the original Messages API rate.

Conclusion

The Message Batches API is Anthropic’s asynchronous, 50%-discounted way to call Claude at scale.
Up to 100,000 requests or 256 MB per batch, completed within 24 hours (often much sooner).
Best for offline workloads where latency is irrelevant: bulk summarization, evals, synthetic data, multilingual content pipelines.
Output quality is identical to the synchronous Messages API; only delivery time differs.
Stacks with Prompt Caching for further cost reductions exceeding 90% in some workloads.
Not for chat UIs, agents, or anything requiring streaming — the 24-hour SLA disqualifies those use cases.

References

📚 References

・Anthropic, “Pricing” https://platform.claude.com/docs/en/about-claude/pricing
・Anthropic, “Plans & Pricing” https://claude.com/pricing
・finout, “Anthropic API Pricing in 2026” https://www.finout.io/blog/anthropic-api-pricing

🌐
この記事の日本語版：
Message Batches API（メッセージバッチエスエピーアイ）とは？読み方・Anthropicが提供する50%割引非同期処理APIの仕組み・使い方・通常APIとの違いを完全解説 →