What Is In-context Learning?
In-context Learning (ICL) is a Large Language Model’s ability to learn how to do a new task from a small number of examples shown directly in the prompt, without any weight updates. The phenomenon was popularized by the GPT-3 paper (Brown et al., 2020) and is now considered one of the defining capabilities that makes LLMs general-purpose.
Think of onboarding a new colleague by showing them three sample tickets and how to respond, then handing them the next ticket. They mimic the pattern of the three samples and answer correctly. LLMs do something similar: a few input/output pairs in the prompt are enough for the model to infer the pattern and apply it to a new input — all without modifying its weights.
How to Pronounce In-context Learning
in CON-text LURN-ing (/ɪn ˈkɒn.tɛkst ˈlɜː(r).nɪŋ/)
I-C-L (/aɪ siː ɛl/) — common acronym
How In-context Learning Works
The exact mechanism by which ICL works inside a transformer is still under active research, but the leading hypothesis is that the attention mechanism implicitly performs a form of gradient-descent-like adaptation across the prompt. The model effectively constructs an on-the-fly learner from the demonstration examples, then applies it to the new input. Important: nothing is stored — the “lesson” disappears at the end of the request.
Variants by example count
- Zero-shot: no examples; just a task description.
- One-shot: a single example.
- Few-shot: 2–10 examples; the most common setting.
- Many-shot: dozens to hundreds of examples, enabled by long-context models.
A typical ICL prompt
Task: Classify the sentiment of the following text as positive, negative, or neutral.
Examples:
Input: The movie was incredible.
Output: positive
Input: I really regretted buying this.
Output: negative
Input: It was okay, nothing special.
Output: neutral
Input: This coffee is amazing!
Output:
Background: why ICL was a breakthrough
Before 2020, adapting a model to a new NLP task usually required collecting labeled data and running a fine-tuning job. ICL collapsed that pipeline into “edit the prompt.” Time-to-prototype dropped from weeks to minutes. You should think of ICL as the change that turned LLMs from research artifacts into general-purpose tools.
In-context Learning Usage and Examples
Quick Start
import anthropic
client = anthropic.Anthropic()
prompt = '''Classify each email as complaint, inquiry, or sales.
Example 1:
Email: My package never arrived; please refund.
Class: complaint
Example 2:
Email: Do you have this in stock?
Class: inquiry
Example 3:
Email: Hi, I'd like to introduce our service.
Class: sales
Email: Could you tell me the status of order #12345?
Class: '''
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
print(resp.content[0].text) # expected: inquiry
Common Implementation Patterns
Pattern A: Dynamic few-shot via RAG
similar_examples = vector_db.search(user_query, top_k=3)
prompt = "Use these examples as a guide.\n\n"
for ex in similar_examples:
prompt += f"Input: {ex.input}\nOutput: {ex.label}\n\n"
prompt += f"Input: {user_query}\nOutput: "
Use it for: large pools of labeled examples where you want the most relevant ones per query. Important: this typically beats static few-shot by several accuracy points.
Avoid it for: small or biased example pools where retrieval quality is shaky.
Pattern B: Chain-of-Thought (CoT)
prompt = '''Solve the math problem and show your reasoning.
Example:
Q: There are 3 apples. You add 4 more. How many do you have?
A: Start with 3, add 4, total is 3+4=7. Answer: 7.
Q: Taro has 5 books. He gives 3 to a friend, then buys 8 more. How many does he have now?
A: '''
Use it for: math, multi-step reasoning, complex classification — anywhere stepwise thinking helps.
Avoid it for: simple lookups or format conversions; CoT inflates output length and cost.
Pattern C: Many-shot ICL
prompt = "Examples:\n\n"
for ex in training_examples[:100]:
prompt += f"In: {ex.input}\nOut: {ex.label}\n\n"
prompt += f"In: {query}\nOut: "
Use it for: long-context models such as Claude Sonnet and Gemini 1.5/2.5. Google DeepMind’s 2024 work showed that many-shot ICL can rival fine-tuning for several tasks.
Anti-pattern: contradictory examples
# Anti-pattern
Example 1: "Today is a beautiful day." -> positive
Example 2: "Today is a beautiful day." -> neutral # contradicts Ex 1
Inconsistent examples destroy the pattern the model is supposed to learn. Note that you should run a quick consistency audit on your few-shot examples before shipping the prompt.
Advantages and Disadvantages of In-context Learning
Advantages
- No training pipeline, dataset curation, or compute
- Switch tasks instantly by editing the prompt
- One model can perform many tasks
- Cheap to iterate — change examples and re-run
- Sensitive data does not need to enter a training set
Disadvantages
- Examples are paid for on every request, increasing token cost
- Context window is the upper bound on examples
- Order and choice of examples matter and can swing accuracy substantially
- Small models do not benefit much from ICL
- For very large or specialized data, fine-tuning still wins
In-context Learning vs Fine-tuning vs RAG (Difference)
The three approaches solve overlapping problems but differ in where adaptation happens.
| Aspect | In-context Learning | Fine-tuning | RAG |
|---|---|---|---|
| Weight update | No | Yes | No |
| Data required | A handful to a few hundred | Thousands to millions | A document corpus (no labels) |
| Setup time | Minutes | Hours to days | Days (index build) |
| Inference cost | Higher (examples in every prompt) | Lower (prompt can be terse) | Medium (retrieval payload) |
| Adaptation flexibility | High (instant) | Low (re-train) | High (re-index) |
| Best for | PoCs, sparse data, multi-task | Domain specialization, style | Fresh facts, internal docs |
Important: the canonical playbook is “start with ICL, escalate to RAG, and only fine-tune when ICL/RAG hit a hard wall.”
Common Misconceptions about In-context Learning
Misconception 1: “ICL means the model is actually learning”
Why this is confused: The word “learning” suggests something is stored. Even academic papers say “learn,” but they mean it metaphorically. The reason this misconception spreads is that mainstream coverage describes “AI that learns from a few examples,” which sounds like persistent memory.
The reality: Weights never change. The “knowledge” disappears the moment the prompt ends. ICL is sometimes described in research as “implicit gradient descent” or “meta-learning at inference time” — useful framings, but still not weight updates.
Misconception 2: “More examples always help”
Why this is confused: People extrapolate from “more data is better in ML.” Many primers state that few-shot beats zero-shot, and the reason this turns into “always more is better” is that the comparison is not stated as a curve but as a binary.
The reality: Accuracy typically saturates around 5–10 examples and can degrade beyond that. Lu et al. (2022) showed example order alone can swing accuracy by tens of points. Quality and ordering matter more than count once you pass a small threshold.
Misconception 3: “If ICL works, fine-tuning is obsolete”
Why this is confused: ICL is so easy that practitioners assume fine-tuning is no longer worth the effort. The reason this overgeneralization spreads is the dramatic time-to-prototype advantage of ICL.
The reality: ICL hits walls — context window cap, per-request token cost, accuracy plateau. For high-volume production traffic, fine-tuning often wins on both quality and economics. The two approaches are complementary, not competing.
Real-World Use Cases
1. New-task PoCs
When a stakeholder asks “can the model do X?”, few-shot ICL produces a credible answer in minutes. After validating value, teams often graduate to RAG or fine-tuning.
2. Long-tail class handling
For imbalanced classification with rare classes that have only a handful of labeled examples, ICL avoids the overfitting risk fine-tuning would incur. The model leans on its prior knowledge plus the few examples.
3. Personalized prompt construction
RAG plus ICL composes naturally: retrieve the most relevant examples per query and inject them. This is the dominant pattern for personalized recommendation copy and customer-specific responses.
4. Cross-lingual tasks
Three English-to-Japanese examples are often enough to elicit translation behavior on a fourth pair, even without dedicated multilingual fine-tuning. ICL turns out to be a remarkably efficient cross-lingual bridge.
Mechanism: Why Does ICL Work?
The mechanism behind ICL is one of the most studied open questions in modern NLP research. Several complementary explanations are emerging.
Implicit gradient descent
One line of work argues that attention layers implement a meta-learning algorithm akin to one or two steps of gradient descent over the demonstrations. This “learn-to-learn at inference time” framing has gathered empirical support but is still being refined.
Pattern matching and induction heads
Anthropic’s interpretability research identified “induction heads” — attention heads that look back to find similar tokens and copy or transform from them. Induction heads are believed to underlie much of ICL’s pattern-completion behavior.
Prior elicitation
A complementary view is that demonstrations do not teach the model anything new; they instead select a subset of the model’s pre-trained capabilities. The prompt acts as a “behavior selector” rather than a learner. This explains why ICL is bounded by the model’s prior knowledge.
Emergence with scale
Few-shot ICL improves dramatically as model size increases, peaking in the GPT-3 era at around 175B parameters. Smaller models struggle to leverage examples. The phenomenon is widely cited as the canonical example of “emergent ability.”
Practical Tips for Better ICL
A small collection of techniques that consistently improve ICL accuracy in practice.
Order matters
Place strong, unambiguous examples first; place edge cases later. Many models are biased toward the first or last example seen, so treat ordering as a design parameter, not an afterthought. Important: when in doubt, randomize and average over runs to estimate variance.
Use diverse examples
Examples that look too similar to each other under-cover the input space. Pick examples spanning different lengths, registers, and surface forms.
Match the format exactly
If the input contains markdown, your examples should too. Format mismatch confuses the model and degrades accuracy. The closer the examples look to the live input, the better.
Include a clear task description
A one-line task definition above the examples helps the model anchor the task. Examples without a task statement leave the pattern ambiguous.
Calibrate against zero-shot
Always compare your few-shot prompt against zero-shot. If few-shot does not improve over zero-shot, you may be wasting tokens. Note that you should track both accuracy and cost when making this comparison.
ICL in Production Systems
Moving ICL from a notebook PoC to a production system uncovers a set of operational concerns that toy examples never touch. Below is the playbook successful teams converge on.
Treat the prompt as code
Production prompts contain examples, instructions, formatting hints, and version markers. Treat the entire prompt as a versioned artifact, not a string buried inside application code. Use a prompt registry (or a dedicated prompts/ directory) with code review, change history, and a clear rollback path. Important: silent prompt changes have caused as many incidents as silent code changes; the same rigor applies.
Shadow new prompts before promoting
When you change a few-shot prompt, run it in shadow mode against live traffic and compare outputs against the current production prompt. Promote only when shadow metrics meet your threshold. This catches regressions that offline evals miss because real traffic is more diverse than any test set.
Cache wisely
If the example set is stable, take advantage of Anthropic Prompt Caching (or equivalent) to slash the cost of the repeated prefix. ICL is uniquely well suited to caching because the demonstrations are constant across calls, and only the trailing user input varies. You should benchmark with caching enabled — the cost-equivalent point at which fine-tuning becomes cheaper shifts dramatically.
Watch the example pool
Static example sets become stale. Re-pick examples periodically based on production failures. A simple loop: collect cases the model gets wrong, label them, add the best ones as new examples, retire less informative examples. This keeps the prompt fresh without retraining anything.
Instrument for tail latency
Long ICL prompts hurt time-to-first-token. Measure p95 and p99 latency, not just average. If long examples are pushing tail latency, consider RAG-style dynamic example selection or move to a fine-tuned model with a shorter prompt. Note that you should make this measurement before scaling traffic; latency is harder to fix later.
ICL Limitations and How to Spot Them
ICL is powerful but bounded. Recognizing where it falls short prevents wasted effort.
Symptom: accuracy plateaus despite more examples
If your accuracy curve flattens after 5–10 examples and adding more does not help, you are at the ICL ceiling for your task. Move to a fine-tuned model or to RAG with a stronger retriever. Important: this plateau is not a prompting failure; it is the structural limit of in-context adaptation.
Symptom: wild variance across example orderings
If shuffling the example order changes accuracy by more than a few points, the model is partially relying on positional cues rather than semantic patterns. This often signals that the task is too hard for ICL alone. You should consider chain-of-thought prompting or a fine-tuned classifier.
Symptom: regressions on rare classes
Few-shot examples typically cannot cover the long tail. Rare classes often need either RAG (to surface domain examples on demand) or a fine-tuned head trained on the long tail. Note that you should evaluate per-class accuracy, not just overall.
Symptom: degraded performance on small models
Small models (under ~1B parameters) often ignore in-prompt examples. If your deployment requires a small model for cost or latency reasons, fine-tuning is usually the right move; ICL alone will not deliver.
Symptom: token cost dominates per-call cost
If each request pays for thousands of demonstration tokens, fine-tuning becomes attractive purely on economics. Calculate the breakeven: at what daily volume does fine-tuning amortize? For many high-traffic APIs, the breakeven is just a few weeks.
The 2026 State of ICL Research
ICL remains an active research area. Notable directions as of 2026:
Many-shot scaling
With 1M+ token context windows now common, researchers are pushing ICL to hundreds or thousands of examples. Google DeepMind’s 2024 paper showed many-shot ICL can match fine-tuning on classification and translation. The 2026 follow-up work is exploring which tasks benefit most and where the diminishing-returns wall lives.
Demonstration selection
Choosing which examples to include is itself an optimization problem. Methods range from nearest-neighbor retrieval to gradient-based selection to bandit-style adaptive picking. Several recent papers report multi-point accuracy gains over random selection. Note that you should not assume “more diverse” is always better — task-similar examples often beat diverse ones.
Compositional ICL
Researchers are exploring whether models can learn multiple sub-skills from a single mixed prompt. Early results are promising, suggesting “few-shot learning” can blur into “few-shot multitasking.” Important: production teams should treat this as still experimental.
Theoretical grounding
The mechanistic-interpretability community continues to refine the “induction heads” hypothesis. New work in 2026 connects ICL behavior to specific attention circuits, opening the door to engineering models that are explicitly better at ICL. The downstream effect for application engineers is faster, more sample-efficient ICL in next-generation models.
Connections to memory architectures
Some 2026 architectures borrow ICL ideas to design “scratchpad” or “working memory” modules that persist across requests. These hybrids blur the line between ICL and persistent learning, and they may eventually obsolete the strict no-weight-update definition that currently anchors ICL.
Frequently Asked Questions (FAQ)
Q1. How is In-context Learning different from fine-tuning?
Fine-tuning updates the model’s weights using a training pipeline. In-context Learning leaves weights unchanged and adapts the model purely through examples in the prompt at inference time. The trade-off is cost vs flexibility.
Q2. Are few-shot learning and in-context learning the same?
Few-shot is a sub-type of ICL. ICL is the umbrella term covering zero-shot, one-shot, and few-shot — distinguished by how many examples appear in the prompt.
Q3. Does adding more examples always improve accuracy?
No. Performance typically saturates around 5–10 examples and can degrade beyond that. Extra examples also consume context window and cost.
Q4. Is Chain-of-Thought a form of in-context learning?
Yes. Chain-of-Thought (CoT) demonstrates reasoning steps as examples, which is a type of ICL. It significantly improves multi-step reasoning.
Q5. Does ICL work on small models?
Only weakly. Research shows ICL is an emergent ability that becomes reliable in models above roughly 1B–10B parameters. Smaller models often fail to leverage examples.
Conclusion
- In-context Learning lets an LLM learn a task from a handful of in-prompt examples without weight updates.
- Variants are zero-shot, one-shot, few-shot, and many-shot, distinguished by example count.
- Chain-of-Thought is a form of ICL that demonstrates reasoning steps.
- Compared to fine-tuning, ICL is faster to set up but more expensive per request.
- Quality and order of examples often matter more than the count once past a small threshold.
- Mechanism is debated; leading hypotheses involve implicit gradient descent and induction heads.
- Used widely for PoCs, rare-class handling, personalization, and cross-lingual tasks.
References
📚 References
- ・Brown et al. “Language Models are Few-Shot Learners (GPT-3 paper)” arxiv.org/abs/2005.14165
- ・Wei et al. “Chain-of-Thought Prompting Elicits Reasoning in LLMs” arxiv.org/abs/2201.11903
- ・Google Research “Many-shot In-Context Learning” arxiv.org/abs/2404.11018
- ・Lu et al. “Fantastically Ordered Prompts and Where to Find Them” arxiv.org/abs/2104.08786
- ・Anthropic “Induction heads” interpretability research transformer-circuits.pub






































Leave a Reply