Unlocking the Black Box: The Theory Behind Chain-of-Thought in LLMs
If you’ve used modern large language models (LLMs) on hard problems, you know the trick: append a prompt like “Let’s think step-by-step” and—often—the model produces intermediate reasoning and gets the answer right. That simple change, called Chain-of-Thought (CoT) prompting, has become a practical staple for eliciting better performance on math, logic, and reasoning tasks.
But why does CoT help so much? Is it just coaxing the model to reveal what it already knows, or does it fundamentally change what the model can compute?
The paper “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective” (Feng et al.) gives a clear and surprising answer: CoT does more than nudge the model—it fundamentally changes its computational mode. In short, autoregressive Transformers that generate a step-by-step derivation can solve important classes of problems that the same model cannot solve directly (i.e., in one forward pass), unless the model grows to impractical sizes.
This article walks through the main ideas, explains the intuition (without getting lost in technicalities), and connects the theory to experiments that validate the claims.
Roadmap
- Quick primer: autoregressive Transformers and Chain-of-Thought
- The core theoretical tension: parallel shallow computation vs. sequential depth
- Two concrete math tasks: arithmetic expressions and linear systems
- Why direct prediction is provably hard
- How CoT lets a constant-size Transformer succeed
- A broader class: dynamic programming (DP) and why CoT generalizes
- Experiments: training with CoT vs. direct supervision
- Practical takeaways and open questions
A short primer: autoregressive Transformers and Chain-of-Thought
Autoregressive Transformers (the GPT family) generate tokens one at a time. At each step the Transformer attends to all previous tokens (with a causal mask) and produces the next token. Repeating this generates full answers or derivations.
Two elements are relevant for this discussion:
- Transformer architecture: stacked self-attention blocks + position information + feed-forward networks (FFNs). Each forward pass is a shallow—but wide—computation over the entire input sequence.
- Autoregression loop: after producing one token, the model sees that token as input for the next step. Repeating this loop turns many identical forward passes into a sequential (iterative) computation.
Figure: The Transformer block (self-attention + FFN + residuals) is the basic computation unit in autoregressive LLMs.
Chain-of-Thought is a usage pattern: instead of expecting the model to leap to the final answer, we ask it to write out intermediate steps (human-like derivations). The key observation of the paper: those intermediate outputs are not just explanations—they’re computational steps. When the model feeds them back and continues generating, it effectively implements a deeper, sequential algorithm using a fixed shallow block.
The core theoretical tension: shallow-parallel vs. deep-sequential computation
To reason about what a Transformer can or cannot do, the paper uses tools from circuit complexity. Two classes are important:
- \( \mathsf{TC}^0 \): shallow (constant-depth), polynomial-size circuits with majority-like gates. These capture what bounded-depth, finite-precision Transformers can compute in a single forward pass.
- \( \mathsf{NC}^1 \): circuits with logarithmic depth that capture a larger class of efficiently parallelizable, but inherently sequential, computations.
Many nontrivial arithmetic and reasoning tasks are known (or shown) to be \( \mathsf{NC}^1 \)-hard. If a bounded-depth Transformer (a \( \mathsf{TC}^0 \)-like device) could solve them in a single pass, it would imply \( \mathsf{TC}^0 = \mathsf{NC}^1 \), a widely disbelieved collapse in complexity theory. Thus, direct single-pass prediction of many reasoning problems is fundamentally limited for Transformers of bounded depth and reasonable size.
Figure: Complexity hierarchy (informal). Bounded-depth parallel models like log-precision Transformers lie in \( \mathsf{TC}^0 \); many reasoning problems need more depth (e.g., belong to \( \mathsf{NC}^1 \) or beyond).
But the autoregressive generation loop can be used to repeatedly apply the same shallow block, each time taking the previously generated token(s) as new input. This repeated loop increases the effective depth proportional to the number of generated steps, transforming a shallow circuit into a deeper one—enough to simulate algorithms that require sequential reasoning.
That observation is the heart of the theoretical story: CoT lets a fixed-size Transformer simulate deeper computations by using the generated chain as tape / intermediate memory.
Two concrete math problems: where CoT matters
The paper studies two fundamental tasks to illustrate the point:
- Evaluating arithmetic expressions (with operators and parentheses).
- Solving systems of linear equations (Gaussian elimination style).
Both tasks are common subroutines in more complex mathematical reasoning and make for clear CoT-friendly derivations.
Why direct prediction is hard (the impossibility results)
Under reasonable and practical assumptions on arithmetic precision (the paper assumes “log-precision”, i.e., neuron values have \(O(\log n)\)-bit precision), the authors prove:
- For any fixed Transformer depth \(L\) and any polynomial-size network of that depth, there exist input lengths for which the Transformer cannot reliably compute the correct arithmetic expression result. (Informally: Arithmetic is \( \mathsf{NC}^1 \)-hard, but bounded-depth Transformers sit in \( \mathsf{TC}^0 \).)
- The same impossibility holds for solving linear systems: a bounded-depth log-precision Transformer of polynomial size cannot solve general linear equation instances of sufficiently large size.
These impossibility results rest on reductions from classic \( \mathsf{NC}^1 \)-complete problems (e.g., Boolean formula evaluation) and the mapping between Transformer forward passes and shallow circuit classes.
Intuitively: trying to ask a shallow, highly parallel block to produce a complex sequential answer in one shot is like expecting a one-layer program to run a whole multi-step algorithm; without enough depth or unbounded precision, it fails.
How CoT lets a constant-size Transformer succeed (the constructive results)
Surprisingly, the authors also show the flip side:
- For any input size, there exists an autoregressive Transformer of constant architectural size (constant depth, constant hidden dimension—independent of the input length) that, when used autoregressively to generate a CoT sequence, produces a correct step-by-step derivation for every arithmetic expression up to that input size.
- Similarly, a constant-size Transformer can generate correct CoT derivations for systems of linear equations (Gaussian elimination steps), for any number of variables (up to the problem size under consideration).
How is that possible? The construction relies on two insights:
- Softmax attention + multi-heads + FFNs can be arranged to implement a small set of primitives: conditional copies, conditional averages (means), local arithmetic operations, conditional selections, and lookup tables. These primitives let a shallow block implement one “update” step in a parallel algorithm.
- The iterative autoregressive loop uses the same Transformer block repeatedly. Each generated CoT token serves as input to the next step, so a constant block can build a long sequential computation purely through generation. The effective computation depth equals the number of CoT tokens generated.
A useful way to think about it: the Transformer is a tiny, fixed “processor” and the CoT is a program written as tokens. Each generation step applies the same processor to advance the program state. That converts shallow hardware into deep computation via time (generation steps).
The authors provide careful constructions and error analyses showing the parameter magnitudes remain polynomial in input size, so the constructions are realistic under the log-precision model.
Generalizing: Chain-of-Thought implements Dynamic Programming
The arithmetic and linear algebra examples highlight CoT for math. The paper goes broader: dynamic programming (DP).
Dynamic programming solves complex problems by decomposing them into an ordered set of overlapping subproblems and solving them in topological order: compute dp for some base cases, then use those results to compute dependent states, and finally aggregate.
Formally, many DP formulations can be written as
\[ \mathsf{dp}(i) = f\bigl(n, i, s_{g_1(n,i)},\dots,s_{g_J(n,i)},\ \mathsf{dp}(h_1(n,i)),\dots,\mathsf{dp}(h_K(n,i))\bigr), \]with a later aggregation step collecting the needed dp values.
The main theorem in the DP section says, roughly:
- Under mild and practical assumptions (the state space size is polynomial, the dp transition and aggregation functions can be approximated efficiently by small MLPs, etc.), a constant-size autoregressive Transformer can generate the entire CoT sequence of dp states in topological order and thereby solve the DP problem for all inputs up to a given length.
This is a strong and general statement. It implies CoT enables Transformers to emulate large families of classical algorithmic solutions—LIS (longest increasing subsequence), edit distance, and many other DP algorithms fall into this framework.
At the same time, the authors show impossibility results: some DP-like problems (e.g., context-free grammar membership testing, which is P-complete) remain impossible to predict directly with a bounded-depth Transformer but are solvable (in the autoregressive CoT sense) by repeated application of the model.
Together, these results position CoT as an architectural mechanism for implementing sequential algorithmic processes using autoregressive generation.
Figure: A DP state transition typically looks up a small number of input tokens and a small number of previously computed dp states to compute the current dp.
Experiments: can models learn CoT solutions from data?
The theoretical constructions show what is possible. But can practical Transformers learn these CoT procedures from data? The paper runs careful experiments on four tasks:
- Arithmetic expression evaluation (finite-field arithmetic)
- Linear equation solving (Gaussian elimination over a finite field)
- Longest Increasing Subsequence (LIS)
- Edit Distance (ED)
For each task the authors create two dataset types:
- Direct datasets: (problem → final answer)
- CoT datasets: (problem → full step-by-step derivation + final answer)
They train standard (small) Transformers (hidden 256, 4 heads) with different depths (3, 4, 5 layers), using the same training budget and tokenization.
The experimental results are striking:
- Models trained on CoT datasets (even shallow 3-layer Transformers) achieve near-perfect accuracy across all tasks and difficulty levels.
- Models trained on direct prediction perform much worse, often failing dramatically as problem size increases. Increasing depth helps some, but not enough to match CoT-trained models.
- CoT-trained models generalize well to longer inputs than seen at training time (length extrapolation), indicating they learned algorithmic structure rather than merely memorizing input-output pairs.
- CoT training is robust: even when parts of CoT steps are missing or corrupted in the training data, models still learn to solve the tasks with high accuracy.
Below is a compact visualization summarizing the main experimental comparison (four algorithmic tasks). The purple bars are CoT-trained (L=3) and the blue shades are direct-training at various depths.
Figure: CoT training consistently produces near-perfect performance, while direct prediction degrades with problem complexity and usually fails even for deeper models.
The paper also includes an extrapolation test: trained on arithmetic problems with up to 15 operators, the CoT model was tested on 16–18 operators and retained very high accuracy—evidence it learned the underlying recursive algorithm.
Figure: Length extrapolation results for arithmetic expression evaluation. A CoT-trained model generalizes to longer expressions.
Putting the pieces together: what this explains about LLM behavior
Why does the simple instruction “Let’s think step-by-step” frequently turn a wrong answer into a correct one?
- By asking for intermediate reasoning, we convert the task from a single-shot computation (which shallow forward passes struggle with) into a sequential, iterative process the model can implement through autoregressive generation.
- The autoregressive loop converts a fixed shallow block into a deeper computation performed over time. Each generated step is both an output and an input for the next step.
- Empirically, exposing correct intermediate steps during training (or prompting with a few CoT examples) helps the model learn the update rule for the DP-like algorithm; once learned, the model can reproduce multi-step derivations and arrive at correct answers—even on longer instances.
In other words, CoT is not merely a prompting trick to make LLMs more persuasive; it unlocks a different computational regime that is better suited for algorithmic, stepwise reasoning.
Practical takeaways
- When the task has an inherently sequential structure (math derivations, algorithmic reasoning, many DP-style problems), prefer training or prompting that produces intermediate steps. CoT is not just interpretable—it’s computationally enabling.
- Small, shallow Transformers can implement nontrivial algorithms if they learn to generate CoT sequences. You don’t necessarily need a huge depth for each forward pass; depth can be amortized across autoregressive steps.
- If you’re training models to explain or show calculations, adding CoT-style supervision can substantially improve final-answer correctness (not just interpretability).
- CoT training helps models generalize algorithmically (length extrapolation), so it can be a practical strategy when you care about robustness to larger inputs.
Limitations, caveats, and open questions
The paper answers key expressivity questions but leaves important directions open:
- Triggering CoT: The theory assumes a model generates CoT steps; it doesn’t explain why certain prompts (e.g., “Let’s think step-by-step”) reliably coax this behavior in practice for large models. Understanding how prompts interact with learned internal dynamics remains an open problem.
- Model scaling: Empirically, larger models show more emergent CoT ability. How does scale change the ability to learn CoT strategies? The paper focuses on expressivity given an appropriate construction, not on how scale interacts with learning.
- Learning from limited CoT: The experiments used abundant CoT data. In practice CoT annotations are costly. How few step-by-step examples suffice for learning generalizable algorithms?
- Human-level fidelity: The paper shows the model can simulate algorithmic steps. Whether generated CoT is faithful, meaningful to humans, or simply a program of internal states is an empirical question requiring further study.
Conclusion
This work gives a clear theoretical and empirical account of why Chain-of-Thought prompting is powerful: it converts an otherwise shallow-parallel computation into a deep-sequential computation through autoregressive generation. Autoregressive Transformers, when used to generate intermediate steps, become capable of implementing classical algorithmic processes (like dynamic programming) and solving problems that are provably infeasible for the same model to solve directly under realistic precision and size constraints.
If you ask a model to “think step-by-step,” you’re not only asking for an explanation—you are enabling it to compute differently. That reframing shifts how we should think about model design, training data, and the role of intermediate supervision when we want models to reason reliably.
If you’re exploring tasks that look algorithmic (math, planning, DP, parsing), this paper gives both a theoretical rationale and practical encouragement to use and study CoT-style supervision.
References and further reading
- For an accessible primer on circuit complexity classes mentioned here, see Arora & Barak, “Computational Complexity: A Modern Approach.”
- For background on Chain-of-Thought prompting and empirical studies of CoT, see Wei et al. (2022), “Chain of Thought Prompting Elicits Reasoning in Large Language Models.”
- For the full technical development, proofs, and detailed experiments discussed in this post, see the original paper: “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective” (Feng et al.).