What Is Chain of Thought (CoT)?
Chain of Thought (CoT) is a prompting technique that asks a large language model to produce a step-by-step reasoning trace before its final answer, dramatically improving accuracy on arithmetic, logic, symbolic, and multi-step tasks. The concept was formalized by Jason Wei and colleagues at Google in the January 2022 paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” and it quickly became one of the most consequential prompting ideas in the LLM era.
A simple analogy: in a math class, students who write down intermediate steps tend to score higher than students who try to produce the final answer in one go. The same is true for sufficiently large language models. By asking the model to externalize its thinking into tokens that then condition the rest of the generation, we turn a single-shot answer into a structured multi-step reasoning process. You should note that this emergent behavior tends to show up only in models above roughly 60–100 billion parameters.
How to Pronounce Chain of Thought
chayn uhv thawt (/tʃeɪn əv θɔːt/)
see-oh-tee (common CoT abbreviation)
How Chain of Thought Works
A normal LLM prompt is answered in one shot: the model reads the question and emits the answer. CoT inserts an intermediate reasoning surface between the question and the answer. Because autoregressive LLMs condition each new token on everything that came before, producing reasoning tokens first yields a richer, more structured context for the final answer tokens.
Three Major Variants
1. Few-shot CoT
Include two or three in-context examples of “question → reasoning → answer” before the real question. Very effective but consumes prompt space.
2. Zero-shot CoT
Proposed by Kojima et al. (May 2022). Simply adding “Let’s think step by step” (or equivalent) elicits chain-of-thought reasoning without any exemplars. It is the cheapest and most general form.
3. Self-Consistency CoT
Proposed by Wang et al. (2022). Sample multiple CoT chains for the same problem and take a majority vote of the final answers. Significantly improves robustness on tasks with high reasoning variance.
Normal Prompt vs Chain of Thought
Question → Answer
Question → Reasoning → Answer
Relationship to Reasoning Models (o1, Extended Thinking, Thinking)
Starting in late 2024, OpenAI’s o1 series, Anthropic’s Extended Thinking, and Google Gemini’s Thinking mode began automating CoT inside the model itself. These “reasoning models” perform an extended internal chain-of-thought before responding. Think of them as productizing the CoT idea into the model, rather than asking the user to prompt for it.
Chain of Thought Usage and Examples
The canonical Zero-shot CoT example:
# Without CoT (often wrong)
prompt = "Taro has 5 apples. He eats 2 of them, receives 3 from Hanako,
then gives half to his sister. How many apples does Taro have left?"
# Model tends to jump straight to a (often wrong) number.
# With Zero-shot CoT
prompt = "Taro has 5 apples. He eats 2 of them, receives 3 from Hanako,
then gives half to his sister. How many apples does Taro have left?
Let's think step by step and show your reasoning before the final answer."
# Model output:
# Step 1: 5 - 2 (eaten) = 3
# Step 2: 3 + 3 (received) = 6
# Step 3: 6 / 2 = 3 (after giving half away)
# Answer: 3
Few-shot CoT Example
prompt = '''Use the examples as a template. Think step by step, then answer.
Example 1:
Q: A bottle holds 12 cups. Three people share it equally. How many cups each?
A: 12 / 3 = 4. Answer: 4 cups.
Example 2:
Q: A 500-yen item is 20% off. What is the final price?
A: Discount = 500 * 0.2 = 100. 500 - 100 = 400. Answer: 400 yen.
Now:
Q: A part-time job pays 1200 yen per hour. Last month you worked 50 hours.
If income tax is 10%, what is the net take-home?
A:'''
Advantages and Disadvantages of Chain of Thought
Advantages
- Significant accuracy gains on arithmetic, logic, and symbolic reasoning.
- Transparent intermediate steps make it easier to debug errors.
- Zero-shot CoT is literally one extra sentence.
- No fine-tuning required — unlocks latent capability in base models.
- Natural support for task decomposition.
Disadvantages
- More output tokens mean more cost and higher latency.
- Small models (below ~10B params) see little benefit — it is an emergent ability.
- Reasoning chains can be wrong while still arriving at the right answer (coincidence).
- CoT can hallucinate plausible-sounding but fabricated reasoning.
- Overkill for short, factual questions — hurts UX there.
Chain of Thought vs Other Reasoning Techniques
Note how CoT relates to adjacent prompting patterns:
| Technique | Idea | Best for |
|---|---|---|
| Plain prompt | One-shot Q→A | Simple QA, summaries |
| CoT | Q→reasoning→A | Math, logic, multi-step |
| Self-Consistency | Majority vote across CoT samples | High-variance tasks |
| Tree of Thoughts | Tree search over reasoning | Planning, optimization |
| Extended Thinking | Built-in CoT inside the model | Frontier-difficulty problems |
Common Misconceptions
Misconception 1: “Think step by step” always helps
Not for short factual lookups or simple translations. Forcing CoT there just adds length and latency without improving accuracy. Match the technique to the task.
Misconception 2: If the reasoning looks right, the answer is right
The “faithfulness problem” shows that CoT traces can be post-hoc rationalizations rather than the true cause of the answer. Do not trust a chain of thought blindly, especially in high-stakes applications.
Misconception 3: Reasoning models like o1 make CoT obsolete
Reasoning models automate CoT internally, but explicit CoT instructions in user prompts can still improve performance. And for non-reasoning models, CoT remains indispensable.
Real-World Use Cases
- Text-to-SQL: reason about schema before emitting the query.
- Legal and contract analysis: step through clauses to detect conflicts.
- Clinical decision support: symptoms → differentials → recommended tests.
- Math and physics tutoring: show work so students can learn, not just see answers.
- Debugging: stack trace → hypothesis → verification steps.
- Agent planning: decompose a high-level goal into tool-call sequences.
Frequently Asked Questions (FAQ)
Q1. Does CoT work in non-English languages?
Yes. Equivalent phrases in Japanese, Spanish, Chinese, etc. work similarly. The effect is a property of large models, not a property of English.
Q2. How long should the reasoning be?
Let the model decide based on task complexity. Nudge it with “concise steps” or “a very detailed breakdown” as needed.
Q3. Can I combine CoT with RAG?
Absolutely — they complement each other. RAG brings the facts; CoT integrates them into a coherent answer. This is a common production pattern in enterprise LLM apps.
Q4. Should I show the reasoning to end users?
It depends on the domain. For education, medicine, legal, and other evidence-heavy contexts, showing reasoning builds trust. For customer support chatbots, consider hiding it for brevity.
Advanced Chain of Thought Variants
Beyond the three core forms, several important CoT variants have emerged in the research literature. Keep in mind that the right variant depends on the problem structure.
Tree of Thoughts (ToT)
Introduced by Yao et al. (2023), Tree of Thoughts generalizes CoT from a single linear chain to a tree of candidate reasoning paths. At each step, the model proposes multiple continuations, evaluates them, and expands only the most promising branches. ToT is effective for planning and search problems like the Game of 24, crossword puzzles, and creative writing.
Graph of Thoughts (GoT)
A further generalization where nodes in the reasoning graph can be combined and revised, not just expanded. Besta et al. (2023) showed improvements on tasks like sorting and set operations where multiple intermediate results need to be merged.
Program-Aided Language Models (PAL)
Instead of reasoning in natural language, PAL asks the model to emit executable Python that, when run, yields the answer. This is especially effective for arithmetic and symbolic manipulation, where language tokens are a poor substitute for real computation. Important to note: PAL requires a secure Python executor and input sanitization.
Least-to-Most Prompting
Zhou et al. (2022) showed that asking the model to break a problem into subproblems first, then solve each, can help on compositional tasks where single-shot CoT struggles. It is closely related to task decomposition in agent frameworks.
Reflexion
Reflexion-style prompting has the model self-critique its answer after each attempt, then retry. It turns CoT into a mini reinforcement loop at inference time, trading extra tokens for better accuracy.
Why CoT Works: Mechanistic Insights
Researchers have proposed several complementary explanations for why CoT helps. You should think of these as partial views of the same underlying phenomenon:
- Computation unrolling: each reasoning token provides additional compute, effectively turning a single forward pass into a longer sequential computation.
- Curriculum effect: intermediate steps anchor the model in subproblems it is good at, letting it compose small reliable steps into a complex answer.
- Distributional matching: human-written reasoning data is abundant on the internet; CoT prompts steer the model into that distribution where its training signal is strongest.
- Mode shifting: CoT shifts the model from “pattern completion” mode to “structured reasoning” mode via prompt conditioning.
The Faithfulness Problem
Important caveat: a CoT chain that looks reasonable is not the same as a chain that caused the final answer. Lanham et al. (2023) and follow-ups demonstrated that models sometimes produce post-hoc rationalizations — the answer is determined by internal processes that the verbalized chain does not faithfully describe. You should be cautious about treating CoT output as explanation in high-stakes settings.
Ongoing research aims to increase faithfulness through fine-tuning, activation probing, and interpretability techniques. In the meantime, practical advice: validate CoT traces against ground truth whenever you can, and do not equate a good-looking chain with a correct chain.
Benchmark Results and Empirical Effects
On classic reasoning benchmarks, CoT dramatically improves performance at sufficient model scale. The original Wei et al. paper reported an improvement from 18% to 57% on GSM8K (grade-school math word problems) when moving from standard prompting to CoT on PaLM 540B. Similar gains appeared on MultiArith, AQuA, and StrategyQA. On smaller models (below roughly 60B parameters), CoT often provides little benefit or can even hurt — a classic example of emergent capability.
Reasoning-specialized models in 2025–2026 (OpenAI’s o-series, Anthropic’s Extended Thinking, Gemini Thinking) internalize CoT and can reach near-human-expert scores on benchmarks like AIME, GPQA, and competition-math. Note that this means explicit CoT prompts on these models can be redundant, though sometimes additional guidance still helps.
When Not to Use CoT
You should resist applying CoT everywhere. Cases where it hurts include:
- Short factual lookups (adds latency with no quality gain).
- Real-time chat where response time dominates UX.
- Simple translation or rewording tasks.
- Workloads where the model is below the scale at which CoT emerges.
- Situations where extra tokens meaningfully increase cost.
Combining CoT with Retrieval and Tools
In modern agent systems CoT is almost always combined with retrieval and tool use. The canonical pattern is: (1) retrieve relevant documents or call a search tool; (2) reason step by step over the retrieved context; (3) optionally call additional tools based on intermediate conclusions; (4) emit the final answer. Important: keep the reasoning visible in the message log, so downstream evaluators can audit the chain and fix prompts when failures surface.
Practical Prompt Engineering Tips
- Be explicit about format. “Think step by step. Number each step.” yields more consistent traces than a vague “think about this.”
- Anchor the final answer. Ask for the final answer after the steps, with a specific delimiter like “Final answer:” so you can parse it.
- Use self-critique. After the first answer, ask “Check your work — are any of the steps wrong?” This is cheap and often catches errors.
- Combine with examples. A single Few-shot CoT example often outperforms a page of instructions.
- Reserve CoT for the hard parts. For multi-step workflows, apply CoT only at the hardest subtask to balance cost and accuracy.
Outlook
CoT was the prompting idea that started the reasoning era. Its descendants (Self-Consistency, Tree of Thoughts, Reflexion, and the built-in reasoning modes of modern models) continue to extend it. Expect hybrid systems where explicit CoT prompting, internal reasoning, and tool use blend seamlessly. You should keep CoT in your toolbox: it remains one of the highest-ROI prompt engineering techniques available, and an essential mental model for understanding modern AI.
Concrete Benchmark Numbers
Important to ground the discussion in actual numbers. On GSM8K (grade-school math word problems), moving from standard prompting to CoT on PaLM 540B raised accuracy from about 18% to about 57% in the original study. Similar jumps were observed on MultiArith and SVAMP. On commonsense tasks like StrategyQA, CoT helped meaningfully but less dramatically. On small models below about 10B parameters, CoT sometimes left accuracy unchanged or slightly worse, illustrating the emergent-capability nature of the technique.
Fast forward to 2026: reasoning-specialized models routinely score 90%+ on these benchmarks, often approaching or exceeding human expert levels. Keep in mind that benchmark saturation does not eliminate the value of CoT prompting — it simply shifts the frontier to harder benchmarks like AIME, GPQA-diamond, and competition-level programming.
Why Not to Trust CoT as Explanation
Important research result: Lanham et al. (2023) showed that modifying the CoT trace mid-generation often did not change the final answer in expected ways, suggesting that the model’s internal reasoning is not always what the trace claims. This raises the “faithfulness” problem — the gap between what the model is doing and what it says it is doing.
You should therefore avoid treating CoT traces as trustworthy explanations in high-stakes contexts. They are sometimes genuinely descriptive, sometimes post-hoc confabulation. For medical, legal, or financial applications, validate CoT traces against ground-truth rules or have human experts review them. Keep in mind that a persuasive-sounding wrong chain is in some ways more dangerous than a confidently wrong one-shot answer.
Integration with Modern Reasoning Models
Important evolution: OpenAI’s o1, Anthropic’s Extended Thinking, and Google Gemini’s Thinking mode all embed extensive internal chain-of-thought as a built-in model behavior. The model generates a private reasoning trace before emitting the visible answer. Users see a summarized or hidden version, not the raw chain.
Practical implications for prompt engineers:
- You usually do not need to write “think step by step” to a reasoning model — it already does.
- Reasoning models expose a “thinking budget” parameter (number of reasoning tokens); tuning it trades cost for quality.
- Explicit CoT prompts can still help guide the style or structure of reasoning.
- For non-reasoning models, explicit CoT remains the primary accuracy lever.
Keep in mind that reasoning models are more expensive per call but cheaper per correct answer on hard problems. You should measure cost-per-solved-problem, not just cost-per-token.
CoT and Program-Aided Language Models
Important alternative worth knowing: Program-Aided Language Models (PAL) replace natural-language reasoning with executable code. Instead of “Step 1: 5 – 2 = 3”, the model emits Python like apples = 5 - 2 + 3; apples //= 2 and executes it. The answer comes from the interpreter, not the language model’s internal arithmetic. This is dramatically more accurate on numeric and symbolic tasks.
Production teams frequently combine CoT and PAL: use CoT to plan the problem structure, then invoke a code interpreter for any arithmetic or data manipulation. Keep in mind that executing model-written code carries risk and requires sandboxing, but the accuracy payoff is often worth the engineering effort.
CoT in Multi-Agent Systems
Important pattern in complex agent workflows: a “planner” agent produces a high-level CoT, and subordinate “worker” agents execute each step. The planner’s chain is often long and deliberative; the workers’ chains are short and focused. This separation of concerns scales better than trying to do everything in one monolithic chain.
You should pay attention to how the planner’s reasoning is passed to workers. Naive handoffs lose nuance; structured handoffs (with explicit subtask definitions, constraints, and acceptance criteria) preserve the planner’s intent. Keep in mind that an LLM’s “plan” is only as useful as the execution that follows it.
Prompt Engineering Patterns That Enhance CoT
Important patterns that reliably improve CoT quality in practice:
- Role and expertise framing: “You are an experienced mathematician…” can nudge the model into more rigorous reasoning.
- Explicit format control: numbered steps, delimiters around the final answer, and explicit “Show your work” directives.
- Self-verification: after producing an answer, ask the model to verify it or identify potential flaws.
- Counterfactual prompting: “What if the answer were different? Work through that case too.”
- Decomposition templates: “First identify the relevant information. Then outline the approach. Then execute the steps.”
You should build a small library of reliable CoT prompt templates for your domain and treat them like reusable code. Keep in mind that prompt engineering is increasingly a software engineering discipline, with version control, testing, and performance evaluation.
Measuring CoT Success
Important evaluation practices for CoT-based systems:
- Accuracy on held-out problem sets (the fundamental metric).
- Step-level correctness (are individual reasoning steps valid?).
- Latency budget (how many tokens does average success require?).
- Cost per correct answer (a better metric than cost per token).
- Robustness under paraphrase (does the same problem in different words still succeed?).
- Failure mode analysis (where and why does CoT break down?).
Keep in mind that CoT can look impressive in anecdote and brittle in aggregate. You should maintain automated evaluations and regenerate metrics whenever you tweak prompts or upgrade models.
Evaluating CoT Quality
Not all chains of thought are equally valuable. You should develop the skill of evaluating CoT outputs because the apparent quality of reasoning does not always correlate with answer correctness. In practice, four dimensions matter most.
First, logical validity: do the intermediate steps actually follow from each other, or does the model make leaps? Important: fluent-sounding chains can hide invalid inferences. Check that each step’s conclusion is supported by the prior step’s premises.
Second, factual grounding: are the facts cited in the chain actually true? Models can produce confident-sounding chains that include hallucinated dates, formulas, or citations. Note that for high-stakes domains, every factual claim in the chain should be verifiable.
Third, completeness: does the chain consider edge cases and alternatives, or does it commit to the first plausible direction? Strong reasoners explore multiple hypotheses before committing. You should push models toward completeness by explicitly requesting alternatives.
Fourth, faithfulness: does the chain reflect the actual computation that led to the answer, or is it a post-hoc rationalization? This is the hardest dimension to evaluate and an active research area.
CoT in Multi-Step Agent Systems
CoT becomes particularly important in agent systems that execute multiple actions. Keep in mind that in these systems, reasoning is interleaved with tool use, and the quality of each reasoning step determines whether the next tool call succeeds.
- ReAct pattern: Yao et al. (2023) introduced the Thought-Action-Observation loop where the model explicitly reasons about what to do, acts via a tool, observes the result, and reasons about the next step
- Plan-and-Execute: The model first produces a multi-step plan using CoT, then executes each step. This separates strategic reasoning from tactical execution
- Reflexion loops: The agent critiques its own CoT after each action, identifying errors and adjusting strategy. Important: this increases cost but improves success rates on hard tasks
- Multi-agent deliberation: Two or more agents exchange CoT-style arguments, and a judge agent selects the best answer. Used in research systems for complex decisions
Note that in all these patterns, CoT is not just a prompting trick—it is the coordination mechanism that makes complex agent behavior possible. Important: as agents become more capable, investment in CoT quality pays compounding dividends.
Conclusion
- Chain of Thought is the prompting technique that asks LLMs to reason step by step before answering.
- Formalized by Google Research in a landmark January 2022 paper.
- Zero-shot CoT needs just one extra sentence like “Let’s think step by step.”
- Self-Consistency and Tree of Thoughts are useful extensions.
- Reasoning models (o1, Extended Thinking, Thinking mode) internalize CoT.
- Best for arithmetic, logic, and multi-step reasoning tasks.
- Watch out for the faithfulness problem and higher token costs.
References
📚 References
- ・Wei, J. et al. (2022) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” arXiv:2201.11903 https://arxiv.org/abs/2201.11903
- ・Kojima, T. et al. (2022) “Large Language Models are Zero-Shot Reasoners” arXiv:2205.11916 https://arxiv.org/abs/2205.11916
- ・Wang, X. et al. (2022) “Self-Consistency Improves Chain of Thought Reasoning” arXiv:2203.11171 https://arxiv.org/abs/2203.11171
- ・Google Research Blog “Language Models Perform Reasoning via Chain of Thought” https://research.google/blog/language-models-perform-reasoning-via-chain-of-thought/



































Leave a Reply