What Is Speculative Decoding? A Complete Guide to the Lossless 2-3x LLM Inference Acceleration Technique, EAGLE / Medusa / Multi-Token Prediction Variants, and How It Differs from Continuous Batching

Speculative Decoding

What Is Speculative Decoding?

Speculative Decoding is an inference acceleration technique for large language models that delivers 2–3x throughput improvements without any change to the output distribution. Standard autoregressive generation produces one token per forward pass — and reloads the entire model weights from VRAM each time, which makes inference memory-bandwidth bound. Speculative Decoding pairs the target model with a small, fast draft model that proposes several next tokens; the target model verifies them all in a single forward pass, accepts the longest correct prefix, and continues. Quality is mathematically identical to running the target model alone, but wall-clock latency falls dramatically.

A useful analogy: speculative decoding is like a junior assistant drafting several sentences while a senior reviewer reads them all at once and accepts the matching prefix. If the assistant’s draft is correct, you get many sentences for the price of one review pass; if the draft diverges, the reviewer catches it and the system continues from there. Important to remember that every major LLM serving framework — vLLM, TGI, TensorRT-LLM, SGLang — ships speculative decoding as a built-in optimization, and modern pretrained models like DeepSeek and Qwen3.6 even bake speculative-friendly Multi-Token Prediction heads into their architecture.

How to Pronounce Speculative Decoding

Speculative Decoding (/ˈspɛk.jʊ.lə.tɪv diːˈkoʊ.dɪŋ/)

Spec Decoding (informal short form)

How Speculative Decoding Works

Speculative Decoding originated in research from Google and DeepMind around 2022–2023 and quickly became a default optimization across the LLM inference stack. The core insight is that LLM inference is bottlenecked by memory bandwidth — loading the model weights from VRAM into GPU caches dominates the time per token, while the actual matrix-multiply compute is comparatively cheap. If you can verify several proposed tokens in a single forward pass, you get more useful work for the same memory transfer cost.

Pipeline overview

Speculative Decoding pipeline

1. Draft model proposes k tokens
2. Target verifies all k in one pass
3. Accept longest matching prefix
4. Restart from rejected position

Crucially, Speculative Decoding guarantees an identical output distribution to running the target model alone. Important: this is a mathematical property, not an engineering approximation — it follows from a careful application of Rejection Sampling theory. The system “speculates” about future tokens, but every accepted token still comes from the target model’s distribution. Note that this lossless property is what makes Speculative Decoding so widely deployed; it is one of the rare optimizations in machine learning that costs nothing in quality.

Key variants

Variant Distinguishing feature
Vanilla Speculative Decoding Small draft model + target model (canonical)
Medusa Multiple prediction heads on the target itself
EAGLE / EAGLE-2 Drafts using intermediate target features for higher acceptance
Multi-Token Prediction (MTP) Pretraining-time MTP heads (DeepSeek, Qwen3.6)
Lookahead Decoding Draft-free; uses N-gram lookahead
Cascade Speculative Drafting Multi-tier drafts for further speedup

Speculative Decoding Usage and Examples

Quick Start with vLLM

from vllm import LLM, SamplingParams

# Target: Llama 3 70B, draft: Llama 3 7B
llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    speculative_model="meta-llama/Llama-3-7B-Instruct",
    num_speculative_tokens=5,
)

prompt = "Explain LLM inference optimization techniques."
out = llm.generate([prompt], SamplingParams(max_tokens=512))
print(out[0].outputs[0].text)

Hugging Face transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-7B")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70B")

inputs = tok("Hello", return_tensors="pt")
out = target.generate(
    **inputs,
    assistant_model=draft,
    max_new_tokens=200
)

Common Implementation Patterns

Pattern A: Paired models (vanilla)

# Pick a draft 1/10 to 1/100 the size of the target
# Same family, same tokenizer
# e.g. Llama 70B + Llama 7B, Qwen 72B + Qwen 0.5B

Good fit: Model families that publish multiple sizes with shared tokenizers. The simplest setup with the fewest moving parts.

Bad fit: Mixing tokenizers across families. Token mismatch produces zero acceptance rate, which is worse than no speculation.

Pattern B: Medusa or EAGLE heads

# Add prediction heads to the target model itself
# No second model required
medusa = MedusaForCausalLM.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

Good fit: Teams that prefer not to manage a second model and want to use pre-trained Medusa or EAGLE heads. Operationally simpler.

Bad fit: Custom fine-tuned models without matching pre-trained heads — you must train the heads yourself, which adds significant overhead.

Anti-pattern: Mismatched tokenizers

# DO NOT DO THIS
target = "meta-llama/Llama-3-70B"   # Llama tokenizer
draft = "Qwen/Qwen2-0.5B"            # Qwen tokenizer (different!)
# Tokenizations diverge, every draft is rejected, system slows down

Important: the draft and target models must share the same tokenizer. If tokenization splits differ, the target cannot meaningfully verify the draft’s tokens, acceptance rate goes to zero, and you have all the cost of speculation with none of the benefit. Always pick draft from the same model family as the target.

Advantages and Disadvantages of Speculative Decoding

Advantages

  • 2-3x speedup, often more. Typical chat and code-generation workloads see 2-3x; long-form generation can hit 5x.
  • Zero quality cost. Output distribution is mathematically identical to the target model run alone.
  • Built into all major inference servers. vLLM, TGI, TensorRT-LLM, and SGLang all ship support out of the box.
  • Best returns on long outputs. The longer the generation, the larger the absolute time saved.
  • Modest memory overhead. The draft model is small enough to fit on the same GPU as the target in most setups.

Disadvantages

  • Draft selection matters. A poorly chosen draft (low acceptance rate) can make inference slower than no speculation.
  • Extra VRAM. The draft model occupies additional memory — typically 7–14 GB for a 7B-class draft.
  • Diminishing returns at high batch sizes. When the system is compute-bound rather than bandwidth-bound, the speedup shrinks.
  • Limited applicability outside Transformers. State-space models like Mamba need different optimization techniques.

Speculative Decoding vs Continuous Batching

Both Speculative Decoding and Continuous Batching are inference optimizations widely deployed in production LLM serving, but they target different bottlenecks. The table below clarifies how to think about them.

Aspect Speculative Decoding Continuous Batching
Optimizes Single-request latency Multi-request throughput
Requires Draft model or specialized heads Just scheduler logic
Best when Batch size small to medium Many concurrent requests
Quality impact None (mathematically guaranteed) None (just scheduling)
Extra memory Yes (draft model) None
Stackable Yes, with continuous batching Yes, with speculation

Important to remember: these techniques are complementary and almost always deployed together. vLLM and TGI combine speculative decoding with continuous batching plus PagedAttention to deliver the best-in-class throughput-per-dollar that defines modern LLM serving.

Common Misconceptions

Misconception 1: “Speculative decoding trades quality for speed.”

Why people get confused: The word “speculative” suggests guessing or approximation, and other LLM acceleration techniques like quantization and pruning genuinely do trade quality for speed. The reason this conflation persists is that all these techniques live in the same “make LLMs faster” mental category, even though they are mechanically very different.

The reality: Speculative decoding produces output drawn from exactly the same distribution as the target model running alone. This is a property of Rejection Sampling — the math guarantees it. The only thing that varies is wall-clock latency, not the answer.

Misconception 2: “Bigger draft models give bigger speedups.”

Why people get confused: There’s an intuitive belief that a smarter draft would predict more correctly, so making the draft bigger should make the system faster. The reason this is misleading is that the cost of running the draft model is itself part of the total compute budget.

The reality: A draft that is too large becomes its own bottleneck. The empirical sweet spot is a draft 1/10 to 1/100 the size of the target. Smaller drafts are cheap to run and predict well enough on the easy parts of the sequence, which is where most of the speedup comes from.

Misconception 3: “It works equally well on all workloads.”

Why people get confused: The headline “2-3x speedup” number from research papers gets quoted as if it applies universally. The reason this is misleading is that the speedup depends heavily on workload characteristics — batch size, output length, task domain — that papers usually report only for narrow benchmarks.

The reality: Code generation typically gets less speedup than natural-language chat because code tokens are harder for small drafts to predict. High batch sizes also reduce the speedup because the system shifts from being memory-bound to compute-bound. Important to benchmark on your own workload before committing.

Real-World Use Cases

Chat UIs and conversational agents

Latency dominates user experience in chat interfaces. Cutting time-to-completion in half can fundamentally change how users perceive the product. Important: this is the most common reason teams enable speculative decoding in production.

Long-form generation

Article writing, code synthesis, and document summarization all benefit disproportionately from speculation because the absolute time saved scales with the output length. A 5x speedup on a 30-second generation saves more wall-clock time than a 5x speedup on a 1-second generation.

Self-hosted inference cost reduction

If your GPU fleet handles N requests per second today, speculation lets the same fleet handle 2-3N requests per second. Important to remember that this often translates directly into halving your inference infrastructure bill, or doubling the number of users you can serve from the same hardware.

On-device LLM inference

Phones, laptops, and edge inference appliances are all memory-bandwidth constrained. Speculative decoding (often paired with quantization) is one of the key enabling techniques for the local-LLM movement, including features like Apple Intelligence and on-device assistants.

Batch evaluation and ranking workloads

Even though batch evaluation is throughput-oriented, individual rerank or scoring requests still benefit when the batch size is moderate. Important to test in your specific configuration; the speedup curve as a function of batch size is workload-dependent.

Pretraining-aligned MTP serving

Models pretrained with Multi-Token Prediction (DeepSeek, Qwen3.6, certain experimental Llama variants) ship with built-in speculative-friendly heads. Note that these achieve higher acceptance rates than retrofitted draft pairs, often 70-80% versus 50-60% for vanilla setups.

Tuning Speculative Decoding in Production

Getting the most out of speculative decoding requires careful tuning of a handful of parameters. Important to recognize that the default values shipped by inference frameworks are reasonable starting points but rarely optimal for any specific workload.

Number of speculative tokens (k)

This controls how many tokens the draft proposes per round. Important: too few and you waste opportunity; too many and you waste draft compute on tokens that get rejected. Note that k=4 to k=8 is a common sweet spot for chat workloads, while code generation often benefits from smaller k due to lower acceptance rates.

Draft model fine-tuning

If you fine-tune your target model on a domain corpus, fine-tuning the draft on the same corpus dramatically increases acceptance rate. Important to remember that the value of speculation comes from the draft accurately predicting what the target will say — alignment between the two models matters as much as their absolute size.

Adaptive speculation

Some implementations dynamically adjust k based on recent acceptance rates. Important when serving heterogeneous traffic — easy queries can use higher k for maximum speedup, while harder queries automatically fall back to lower k. Note that this is a relatively recent area of optimization and the best policies are still being developed.

Monitoring acceptance rate

Production deployments should expose the average acceptance rate as a metric. Important because a sudden drop in acceptance rate often signals a distribution shift in the input traffic, which is information you would otherwise miss. Note that 50% acceptance is roughly the break-even point; below that, speculation is hurting more than helping.

Mathematical Foundation: Why Speculative Decoding Is Lossless

The lossless property of Speculative Decoding is not an empirical observation — it is a theorem. Important to understand the underlying math at least at a high level, because it explains why the technique can confidently be deployed in production without any quality verification step.

The Rejection Sampling argument

For each draft token x with draft probability q(x) and target probability p(x), the verification step accepts x with probability min(1, p(x)/q(x)). If accepted, the system proceeds; if rejected, the system samples a replacement from a corrected distribution proportional to max(0, p(x) – q(x)). Important: this construction guarantees that every accepted (or replacement) token comes from exactly the target distribution p, no matter what q looks like. Note that this is a textbook application of Rejection Sampling, dating back decades in statistical computing.

Why bigger drafts do not always help

If q gets closer to p, acceptance rate goes up — which is good. But running a larger q costs more compute per step. The expected total time per generated token is approximately (draft_cost * k + target_cost) / accepted_per_round, and the optimal k and draft size depend on the ratio between draft_cost and target_cost. Important to recognize that this is why a tiny draft (1/100 the size) often outperforms a medium draft (1/10 the size) — the draft contributes less to the per-step cost.

What can go wrong

The math assumes faithful sampling. Important: any divergence from true sampling — for example aggressive temperature, top-p, or top-k truncation — needs to be applied identically to both the draft and the target, or the lossless property breaks. Note that mature inference servers handle this correctly, but custom implementations sometimes get it wrong, which is one of the more subtle bugs in the space.

Production Deployment Notes

Beyond the theory, deploying Speculative Decoding in real systems requires attention to several operational details. The notes below capture lessons learned from teams running speculation in production at scale.

GPU memory budgeting

Reserve memory for both target and draft up front. Important to allocate the KV cache for both models — running out mid-batch causes ungraceful failures. Note that vLLM and TGI both expose configuration knobs for this; pick conservative values during initial rollout and tighten them once you understand your traffic patterns.

Cold-start latency

Loading two models doubles startup time. Important when you scale horizontally — autoscaling decisions need to account for the longer warm-up period. Note that some deployments use canary instances pre-warmed and held in reserve for traffic spikes specifically because the cold-start cost is annoying.

Per-request speculation toggling

Some workloads benefit from speculation while others do not. Important: many production systems expose a per-request flag to disable speculation for queries known to have low acceptance rates (e.g., highly novel domain content). Note that the per-request override gives operators a fast lever when speculation regression is detected.

Cost-benefit thresholds

Important to monitor whether speculation is actually paying off in production. Note that the rule of thumb is that speculation should be disabled if measured acceptance rates fall consistently below 50%, because below that threshold the wasted draft compute outweighs the saved target compute. Note that production teams typically wire this into automated alerting, so degraded acceptance triggers an investigation rather than slowly bleeding cost. Important to remember that this monitoring is what separates “we deployed speculation once and forgot about it” from “we run speculation reliably as part of our LLM serving stack” — the difference is having visibility into whether the optimization is still earning its keep against the current traffic mix.

Inter-model contract stability

Important to recognize that as soon as you fine-tune the target model, the draft model becomes less aligned with it, and acceptance rates drop. Note that mature deployments either fine-tune draft and target together, or recompute the draft after every meaningful target change. Important: the draft is not just a passive participant; it is part of the production model contract and needs to be versioned and tested alongside the target. The maintenance cost of keeping draft and target in sync is real and ongoing — it is part of the total cost of ownership for any speculative-decoding deployment, and underestimating it leads to gradually decaying inference performance over months. The most disciplined teams rebuild their drafts on the same cadence as their target releases.

Frequently Asked Questions (FAQ)

Q1. Does Speculative Decoding change the output?

No. Rejection Sampling theory guarantees that the output distribution is identical to running the target model alone. Speculation accelerates inference but does not change the answer.

Q2. How much faster is it in practice?

Typically 2-3x for chat and summarization workloads, and 3-5x for long-form generation. The actual speedup depends on batch size, task domain, and the chosen draft model.

Q3. How do I pick a draft model?

Choose a model from the same family with the same tokenizer, sized roughly 1/10 to 1/100 of the target. Examples: Llama 70B + Llama 7B, Qwen 72B + Qwen 0.5B.

Q4. How does Speculative Decoding compare with EAGLE and Medusa?

EAGLE and Medusa attach prediction heads to the target itself instead of using a separate draft model. Operationally simpler and often higher acceptance rates, but requires training those heads. Vanilla speculation needs no training but requires a second model.

Q5. Do ChatGPT and Claude use Speculative Decoding?

Vendors do not publish the details of their internal serving stacks, but it is widely assumed across the industry that all major LLM providers use some form of speculative decoding. Models pretrained with MTP (DeepSeek, Qwen3.6) are particularly well-suited to it.

Conclusion

  • Speculative Decoding is a lossless 2-3x inference acceleration for LLMs.
  • A draft model proposes tokens; the target model verifies them in batch.
  • Output distribution is mathematically guaranteed to be identical to running the target alone.
  • Built into vLLM, TGI, TensorRT-LLM, SGLang, and other modern inference servers.
  • Variants include Medusa, EAGLE, Multi-Token Prediction, and Lookahead Decoding.
  • Stacks naturally with Continuous Batching and PagedAttention for production-grade serving.

References

📚 References

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA