What Is Speculative Decoding?
Speculative Decoding is an inference acceleration technique for large language models that delivers 2–3x throughput improvements without any change to the output distribution. Standard autoregressive generation produces one token per forward pass — and reloads the entire model weights from VRAM each time, which makes inference memory-bandwidth bound. Speculative Decoding pairs the target model with a small, fast draft model that proposes several next tokens; the target model verifies them all in a single forward pass, accepts the longest correct prefix, and continues. Quality is mathematically identical to running the target model alone, but wall-clock latency falls dramatically.
A useful analogy: speculative decoding is like a junior assistant drafting several sentences while a senior reviewer reads them all at once and accepts the matching prefix. If the assistant’s draft is correct, you get many sentences for the price of one review pass; if the draft diverges, the reviewer catches it and the system continues from there. Important to remember that every major LLM serving framework — vLLM, TGI, TensorRT-LLM, SGLang — ships speculative decoding as a built-in optimization, and modern pretrained models like DeepSeek and Qwen3.6 even bake speculative-friendly Multi-Token Prediction heads into their architecture.
How to Pronounce Speculative Decoding
Speculative Decoding (/ˈspɛk.jʊ.lə.tɪv diːˈkoʊ.dɪŋ/)
Spec Decoding (informal short form)
How Speculative Decoding Works
Speculative Decoding originated in research from Google and DeepMind around 2022–2023 and quickly became a default optimization across the LLM inference stack. The core insight is that LLM inference is bottlenecked by memory bandwidth — loading the model weights from VRAM into GPU caches dominates the time per token, while the actual matrix-multiply compute is comparatively cheap. If you can verify several proposed tokens in a single forward pass, you get more useful work for the same memory transfer cost.
Pipeline overview
Speculative Decoding pipeline
Crucially, Speculative Decoding guarantees an identical output distribution to running the target model alone. Important: this is a mathematical property, not an engineering approximation — it follows from a careful application of Rejection Sampling theory. The system “speculates” about future tokens, but every accepted token still comes from the target model’s distribution. Note that this lossless property is what makes Speculative Decoding so widely deployed; it is one of the rare optimizations in machine learning that costs nothing in quality.
Key variants
| Variant | Distinguishing feature |
|---|---|
| Vanilla Speculative Decoding | Small draft model + target model (canonical) |
| Medusa | Multiple prediction heads on the target itself |
| EAGLE / EAGLE-2 | Drafts using intermediate target features for higher acceptance |
| Multi-Token Prediction (MTP) | Pretraining-time MTP heads (DeepSeek, Qwen3.6) |
| Lookahead Decoding | Draft-free; uses N-gram lookahead |
| Cascade Speculative Drafting | Multi-tier drafts for further speedup |
Speculative Decoding Usage and Examples
Quick Start with vLLM
from vllm import LLM, SamplingParams
# Target: Llama 3 70B, draft: Llama 3 7B
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
speculative_model="meta-llama/Llama-3-7B-Instruct",
num_speculative_tokens=5,
)
prompt = "Explain LLM inference optimization techniques."
out = llm.generate([prompt], SamplingParams(max_tokens=512))
print(out[0].outputs[0].text)
Hugging Face transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-7B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70B")
inputs = tok("Hello", return_tensors="pt")
out = target.generate(
**inputs,
assistant_model=draft,
max_new_tokens=200
)
Common Implementation Patterns
Pattern A: Paired models (vanilla)
# Pick a draft 1/10 to 1/100 the size of the target
# Same family, same tokenizer
# e.g. Llama 70B + Llama 7B, Qwen 72B + Qwen 0.5B
Good fit: Model families that publish multiple sizes with shared tokenizers. The simplest setup with the fewest moving parts.
Bad fit: Mixing tokenizers across families. Token mismatch produces zero acceptance rate, which is worse than no speculation.
Pattern B: Medusa or EAGLE heads
# Add prediction heads to the target model itself
# No second model required
medusa = MedusaForCausalLM.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")
Good fit: Teams that prefer not to manage a second model and want to use pre-trained Medusa or EAGLE heads. Operationally simpler.
Bad fit: Custom fine-tuned models without matching pre-trained heads — you must train the heads yourself, which adds significant overhead.
Anti-pattern: Mismatched tokenizers
# DO NOT DO THIS
target = "meta-llama/Llama-3-70B" # Llama tokenizer
draft = "Qwen/Qwen2-0.5B" # Qwen tokenizer (different!)
# Tokenizations diverge, every draft is rejected, system slows down
Important: the draft and target models must share the same tokenizer. If tokenization splits differ, the target cannot meaningfully verify the draft’s tokens, acceptance rate goes to zero, and you have all the cost of speculation with none of the benefit. Always pick draft from the same model family as the target.
Advantages and Disadvantages of Speculative Decoding
Advantages
- 2-3x speedup, often more. Typical chat and code-generation workloads see 2-3x; long-form generation can hit 5x.
- Zero quality cost. Output distribution is mathematically identical to the target model run alone.
- Built into all major inference servers. vLLM, TGI, TensorRT-LLM, and SGLang all ship support out of the box.
- Best returns on long outputs. The longer the generation, the larger the absolute time saved.
- Modest memory overhead. The draft model is small enough to fit on the same GPU as the target in most setups.
Disadvantages
- Draft selection matters. A poorly chosen draft (low acceptance rate) can make inference slower than no speculation.
- Extra VRAM. The draft model occupies additional memory — typically 7–14 GB for a 7B-class draft.
- Diminishing returns at high batch sizes. When the system is compute-bound rather than bandwidth-bound, the speedup shrinks.
- Limited applicability outside Transformers. State-space models like Mamba need different optimization techniques.
Speculative Decoding vs Continuous Batching
Both Speculative Decoding and Continuous Batching are inference optimizations widely deployed in production LLM serving, but they target different bottlenecks. The table below clarifies how to think about them.
| Aspect | Speculative Decoding | Continuous Batching |
|---|---|---|
| Optimizes | Single-request latency | Multi-request throughput |
| Requires | Draft model or specialized heads | Just scheduler logic |
| Best when | Batch size small to medium | Many concurrent requests |
| Quality impact | None (mathematically guaranteed) | None (just scheduling) |
| Extra memory | Yes (draft model) | None |
| Stackable | Yes, with continuous batching | Yes, with speculation |
Important to remember: these techniques are complementary and almost always deployed together. vLLM and TGI combine speculative decoding with continuous batching plus PagedAttention to deliver the best-in-class throughput-per-dollar that defines modern LLM serving.
Common Misconceptions
Misconception 1: “Speculative decoding trades quality for speed.”
Why people get confused: The word “speculative” suggests guessing or approximation, and other LLM acceleration techniques like quantization and pruning genuinely do trade quality for speed. The reason this conflation persists is that all these techniques live in the same “make LLMs faster” mental category, even though they are mechanically very different.
The reality: Speculative decoding produces output drawn from exactly the same distribution as the target model running alone. This is a property of Rejection Sampling — the math guarantees it. The only thing that varies is wall-clock latency, not the answer.
Misconception 2: “Bigger draft models give bigger speedups.”
Why people get confused: There’s an intuitive belief that a smarter draft would predict more correctly, so making the draft bigger should make the system faster. The reason this is misleading is that the cost of running the draft model is itself part of the total compute budget.
The reality: A draft that is too large becomes its own bottleneck. The empirical sweet spot is a draft 1/10 to 1/100 the size of the target. Smaller drafts are cheap to run and predict well enough on the easy parts of the sequence, which is where most of the speedup comes from.
Misconception 3: “It works equally well on all workloads.”
Why people get confused: The headline “2-3x speedup” number from research papers gets quoted as if it applies universally. The reason this is misleading is that the speedup depends heavily on workload characteristics — batch size, output length, task domain — that papers usually report only for narrow benchmarks.
The reality: Code generation typically gets less speedup than natural-language chat because code tokens are harder for small drafts to predict. High batch sizes also reduce the speedup because the system shifts from being memory-bound to compute-bound. Important to benchmark on your own workload before committing.
Real-World Use Cases
Chat UIs and conversational agents
Latency dominates user experience in chat interfaces. Cutting time-to-completion in half can fundamentally change how users perceive the product. Important: this is the most common reason teams enable speculative decoding in production.
Long-form generation
Article writing, code synthesis, and document summarization all benefit disproportionately from speculation because the absolute time saved scales with the output length. A 5x speedup on a 30-second generation saves more wall-clock time than a 5x speedup on a 1-second generation.
Self-hosted inference cost reduction
If your GPU fleet handles N requests per second today, speculation lets the same fleet handle 2-3N requests per second. Important to remember that this often translates directly into halving your inference infrastructure bill, or doubling the number of users you can serve from the same hardware.
On-device LLM inference
Phones, laptops, and edge inference appliances are all memory-bandwidth constrained. Speculative decoding (often paired with quantization) is one of the key enabling techniques for the local-LLM movement, including features like Apple Intelligence and on-device assistants.
Batch evaluation and ranking workloads
Even though batch evaluation is throughput-oriented, individual rerank or scoring requests still benefit when the batch size is moderate. Important to test in your specific configuration; the speedup curve as a function of batch size is workload-dependent.
Pretraining-aligned MTP serving
Models pretrained with Multi-Token Prediction (DeepSeek, Qwen3.6, certain experimental Llama variants) ship with built-in speculative-friendly heads. Note that these achieve higher acceptance rates than retrofitted draft pairs, often 70-80% versus 50-60% for vanilla setups.
Tuning Speculative Decoding in Production
Getting the most out of speculative decoding requires careful tuning of a handful of parameters. Important to recognize that the default values shipped by inference frameworks are reasonable starting points but rarely optimal for any specific workload.
Number of speculative tokens (k)
This controls how many tokens the draft proposes per round. Important: too few and you waste opportunity; too many and you waste draft compute on tokens that get rejected. Note that k=4 to k=8 is a common sweet spot for chat workloads, while code generation often benefits from smaller k due to lower acceptance rates.
Draft model fine-tuning
If you fine-tune your target model on a domain corpus, fine-tuning the draft on the same corpus dramatically increases acceptance rate. Important to remember that the value of speculation comes from the draft accurately predicting what the target will say — alignment between the two models matters as much as their absolute size.
Adaptive speculation
Some implementations dynamically adjust k based on recent acceptance rates. Important when serving heterogeneous traffic — easy queries can use higher k for maximum speedup, while harder queries automatically fall back to lower k. Note that this is a relatively recent area of optimization and the best policies are still being developed.
Monitoring acceptance rate
Production deployments should expose the average acceptance rate as a metric. Important because a sudden drop in acceptance rate often signals a distribution shift in the input traffic, which is information you would otherwise miss. Note that 50% acceptance is roughly the break-even point; below that, speculation is hurting more than helping.
Mathematical Foundation: Why Speculative Decoding Is Lossless
The lossless property of Speculative Decoding is not an empirical observation — it is a theorem. Important to understand the underlying math at least at a high level, because it explains why the technique can confidently be deployed in production without any quality verification step.
The Rejection Sampling argument
For each draft token x with draft probability q(x) and target probability p(x), the verification step accepts x with probability min(1, p(x)/q(x)). If accepted, the system proceeds; if rejected, the system samples a replacement from a corrected distribution proportional to max(0, p(x) – q(x)). Important: this construction guarantees that every accepted (or replacement) token comes from exactly the target distribution p, no matter what q looks like. Note that this is a textbook application of Rejection Sampling, dating back decades in statistical computing.
Why bigger drafts do not always help
If q gets closer to p, acceptance rate goes up — which is good. But running a larger q costs more compute per step. The expected total time per generated token is approximately (draft_cost * k + target_cost) / accepted_per_round, and the optimal k and draft size depend on the ratio between draft_cost and target_cost. Important to recognize that this is why a tiny draft (1/100 the size) often outperforms a medium draft (1/10 the size) — the draft contributes less to the per-step cost.
What can go wrong
The math assumes faithful sampling. Important: any divergence from true sampling — for example aggressive temperature, top-p, or top-k truncation — needs to be applied identically to both the draft and the target, or the lossless property breaks. Note that mature inference servers handle this correctly, but custom implementations sometimes get it wrong, which is one of the more subtle bugs in the space.
Production Deployment Notes
Beyond the theory, deploying Speculative Decoding in real systems requires attention to several operational details. The notes below capture lessons learned from teams running speculation in production at scale.
GPU memory budgeting
Reserve memory for both target and draft up front. Important to allocate the KV cache for both models — running out mid-batch causes ungraceful failures. Note that vLLM and TGI both expose configuration knobs for this; pick conservative values during initial rollout and tighten them once you understand your traffic patterns.
Cold-start latency
Loading two models doubles startup time. Important when you scale horizontally — autoscaling decisions need to account for the longer warm-up period. Note that some deployments use canary instances pre-warmed and held in reserve for traffic spikes specifically because the cold-start cost is annoying.
Per-request speculation toggling
Some workloads benefit from speculation while others do not. Important: many production systems expose a per-request flag to disable speculation for queries known to have low acceptance rates (e.g., highly novel domain content). Note that the per-request override gives operators a fast lever when speculation regression is detected.
Cost-benefit thresholds
Important to monitor whether speculation is actually paying off in production. Note that the rule of thumb is that speculation should be disabled if measured acceptance rates fall consistently below 50%, because below that threshold the wasted draft compute outweighs the saved target compute. Note that production teams typically wire this into automated alerting, so degraded acceptance triggers an investigation rather than slowly bleeding cost. Important to remember that this monitoring is what separates “we deployed speculation once and forgot about it” from “we run speculation reliably as part of our LLM serving stack” — the difference is having visibility into whether the optimization is still earning its keep against the current traffic mix.
Inter-model contract stability
Important to recognize that as soon as you fine-tune the target model, the draft model becomes less aligned with it, and acceptance rates drop. Note that mature deployments either fine-tune draft and target together, or recompute the draft after every meaningful target change. Important: the draft is not just a passive participant; it is part of the production model contract and needs to be versioned and tested alongside the target. The maintenance cost of keeping draft and target in sync is real and ongoing — it is part of the total cost of ownership for any speculative-decoding deployment, and underestimating it leads to gradually decaying inference performance over months. The most disciplined teams rebuild their drafts on the same cadence as their target releases.
Frequently Asked Questions (FAQ)
Q1. Does Speculative Decoding change the output?
No. Rejection Sampling theory guarantees that the output distribution is identical to running the target model alone. Speculation accelerates inference but does not change the answer.
Q2. How much faster is it in practice?
Typically 2-3x for chat and summarization workloads, and 3-5x for long-form generation. The actual speedup depends on batch size, task domain, and the chosen draft model.
Q3. How do I pick a draft model?
Choose a model from the same family with the same tokenizer, sized roughly 1/10 to 1/100 of the target. Examples: Llama 70B + Llama 7B, Qwen 72B + Qwen 0.5B.
Q4. How does Speculative Decoding compare with EAGLE and Medusa?
EAGLE and Medusa attach prediction heads to the target itself instead of using a separate draft model. Operationally simpler and often higher acceptance rates, but requires training those heads. Vanilla speculation needs no training but requires a second model.
Q5. Do ChatGPT and Claude use Speculative Decoding?
Vendors do not publish the details of their internal serving stacks, but it is widely assumed across the industry that all major LLM providers use some form of speculative decoding. Models pretrained with MTP (DeepSeek, Qwen3.6) are particularly well-suited to it.
Conclusion
- Speculative Decoding is a lossless 2-3x inference acceleration for LLMs.
- A draft model proposes tokens; the target model verifies them in batch.
- Output distribution is mathematically guaranteed to be identical to running the target alone.
- Built into vLLM, TGI, TensorRT-LLM, SGLang, and other modern inference servers.
- Variants include Medusa, EAGLE, Multi-Token Prediction, and Lookahead Decoding.
- Stacks naturally with Continuous Batching and PagedAttention for production-grade serving.
References
📚 References
- ・NVIDIA Developer, “An Introduction to Speculative Decoding” https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding
- ・Google Research, “Looking back at speculative decoding” https://research.google/blog/looking-back-at-speculative-decoding/
- ・BentoML, “Speculative decoding | LLM Inference Handbook” https://bentoml.com/llm/inference-optimization/speculative-decoding
- ・Calmops, “Speculative Decoding: Lossless LLM Inference Acceleration” https://calmops.com/algorithms/speculative-decoding-llm-inference-acceleration/






































Leave a Reply