Why LLMs Need to Read Twice: The Simple 'RE2' Strategy for Better Reasoning

When you encounter a particularly tricky math word problem or a convoluted logic puzzle, what is the first thing you do? If you are like most humans, you read it again. You scan the text, identify the core question, and then re-read the details to understand how they fit together. This simple cognitive strategy—re-reading—is fundamental to human comprehension.

However, Large Language Models (LLMs) like GPT-4 or LLaMA typically don’t do this. They process text linearly, reading from left to right, token by token. Once they pass a word, they generally don’t “look back” in the same way a human does when reconsidering the context of a whole sentence.

In the paper “Re-Reading Improves Reasoning in Large Language Models,” researchers from the Institute of Information Engineering (CAS), Beihang University, and Microsoft introduce a method called RE2 (Re-Reading). Their finding is surprisingly simple yet profound: merely prompting an LLM to read the input question twice significantly boosts its reasoning capabilities.

In this post, we will explore why standard LLMs struggle with “global” understanding, how the RE2 method fixes this by simulating bidirectional attention, and the impressive performance gains this simple trick unlocks.

The Problem: The Limits of Unidirectional Attention

To understand why re-reading is necessary, we first need to understand how most modern LLMs process text. Models like GPT-3, ChatGPT, and LLaMA are built on decoder-only Transformer architectures.

These models are autoregressive and use unidirectional attention. This means that when the model is encoding a specific token (word or part of a word), it can only “see” the tokens that came before it. It cannot see the tokens that come after it.

Imagine trying to understand a sentence where the most important context clue is the very last word.

Sentence: “The bank, which had steep muddy slopes and was covered in lush grass, was difficult to climb.”
Processing: When the model reads the word “bank” at the start, it doesn’t yet know if this is a financial bank or a river bank. It only figures that out when it hits “muddy slopes” later.

Because the attention mechanism can’t look forward, the model creates a representation of the early words without the full context of the later words. This limitation hinders the model’s ability to perform complex reasoning tasks where understanding the relationship between the beginning and the end of the premise is crucial.

The Solution: RE2 (Re-Reading)

The researchers propose a method called RE2. It does not require retraining the model or changing its architecture. Instead, it is a prompting strategy that forces the model to process the input twice.

How It Works

The concept is straightforward. Instead of feeding the question once, the prompt repeats the question.

Standard Prompt:

Q: [Input Query] A: Let’s think step by step…

RE2 Prompt:

Q: [Input Query] Read the question again: [Input Query] A: Let’s think step by step…

By repeating the question, the researchers artificially create a mechanism for bidirectional attention.

Figure 1: Example inputs of CoT prompting versus CoT prompting with RE2.

As illustrated in the image above:

Top (Standard CoT): The model processes the question in one pass. The token “Roger” cannot attend to “How many” because “How many” appears later.
Bottom (RE2): The model processes the question twice. During the Second Pass, when the model processes “Roger” (again), it can attend back to the First Pass. Since the First Pass contains the entire question (including the end), the model now effectively has “future” knowledge of the sentence structure.

The “Bidirectional” Effect

This technique effectively patches the unidirectional limitation of decoder-only models. During the second reading, every token has access to the “global” information provided by the first reading.

The researchers visualized this effect using attention heatmaps from the LLaMA-2 model.

Figure 2: Illustration of the attention distribution in LLaMA-2 by repeating the question as the input.

In this heatmap:

The vertical axis represents the query tokens (what the model is reading now).
The horizontal axis represents the key tokens (what the model is looking back at).
The Red Triangle: This area shows the attention of the Second Pass looking back at the First Pass.

The dark spots in that upper triangle prove that when the model reads the question the second time, it is heavily relying on information from the end of the first pass. The model is effectively saying, “Now that I’ve seen the whole question once, I can re-read the beginning with a better understanding of the ending.”

Formalizing RE2

Mathematically, standard Chain-of-Thought (CoT) reasoning tries to generate a rationale \(z\) and an answer \(y\) based on the input \(x\).

Equation showing the probability distribution for standard reasoning.

RE2 changes the input conditioning. Instead of just \(x\), the input becomes a re-reading operation \(re2(x)\).

Equation showing the probability distribution with the RE2 modification.

This slight modification allows the rationale generation (\(z\)) to be conditioned on a much richer, “bidirectional” representation of the input.

Experiments and Results

The researchers tested RE2 across a massive suite of benchmarks: 14 datasets spanning arithmetic, commonsense, and symbolic reasoning. They tested these on models ranging from basic ChatGPT (GPT-3.5) to LLaMA-2-70B and Davinci-003.

Arithmetic Reasoning

The primary test bed was math word problems (using datasets like GSM8K and SVAMP), which require precise understanding of dependencies between variables.

Table 1: Results on arithmetic reasoning benchmarks.

The table above highlights the consistency of the method:

Vanilla vs. Vanilla+RE2: Adding re-reading improves performance almost universally.
CoT vs. CoT+RE2: Even when using Chain-of-Thought (which is already a powerful reasoning method), adding RE2 provides a further boost. For example, on the GSM8K benchmark with Davinci-003, performance jumped from 58.98 to 61.64.

Does Re-Reading More Help?

If reading twice is good, is reading three times better? The researchers analyzed the “Times of Reading” to find the optimal point.

Figure 3: Evaluation results of the times of reading on GSM benchmark.

The data suggests a “sweet spot.”

Performance peaks at 2 readings.
Reading 3, 4, or 5 times leads to diminishing returns or even performance degradation.
Why? The researchers suggest that excessive repetition might confuse the model, causing it to mimic the repetition in its output rather than solving the problem. It also deviates too far from the training data distribution (where questions are rarely repeated 5 times).

Handling Complexity

One of the most interesting findings is that RE2 is particularly effective for complex questions.

Figure 4: Left figure: model performance versus complexity. Right figure: n-gram recall.

Left Chart: The bars show that as question complexity (number of reasoning steps) increases, the gap between CoT (light beige) and CoT+RE2 (blue striped) often widens or remains robust.
Right Chart: This shows “n-gram recall,” which measures how much the generated explanation references the original question. RE2 has higher recall, indicating that the model pays closer attention to the specific details of the question when generating its answer.

Compatibility with Other Methods

RE2 is an “input-side” enhancement, meaning it is compatible with almost any “output-side” prompting strategy.

Thought-Eliciting Prompts: The researchers tested RE2 with other advanced prompts like Plan-and-Solve (PS) and Program-Aided Language (PAL).

Table 3: Evaluation results of some thought-eliciting promptings beyond CoT with RE2.

As shown above, RE2 boosts Plan-and-Solve and PAL just as it boosts standard CoT.

Few-Shot Learning: RE2 also works when providing examples (few-shot prompting).

Table 4: Evaluation results on arithmetic reasoning benchmarks under few-shot setting.

Self-Consistency: A popular technique for boosting LLM performance is “Self-Consistency,” where the model generates multiple answers, and the most common one is selected. RE2 can be combined with this.

Table 6: Evaluation results of re-reading with self-consistency.

The combination of CoT + RE2 + Self-Consistency (SC) yielded the highest results (87.70 on SVAMP), proving that RE2 adds value even to the strongest existing inference pipelines.

What about Efficiency?

A valid concern is that doubling the input text doubles the cost and time of processing. The researchers measured the impact on inference time and memory usage.

Figure 5: RE2’s impact on inference efficiency and GPU memory usage.

Inference Time (Top Chart): There is a slight increase in time, but it is not double. Because the generation phase (answering) takes up a significant portion of the total time, increasing the input processing time doesn’t ruin the overall latency.
Memory Usage (Bottom Chart): The memory footprint remains almost identical.

This makes RE2 a very “cheap” way to gain accuracy compared to using a larger model or generating dozens of self-consistency paths.

Deep Dive: Visualizing the Attention Shift

To conclusively prove that RE2 changes how the model thinks, the paper provides a detailed visualization of attention matrices.

Figure 6: Attention visualization with and without RE2.

Triangle (a) - CoT: The model (CoT block) attends to the First Pass.
Triangle (b) - CoT+RE2: The visual structure changes. There is a distinct “Second Pass” block. The connections (lines) show that the Chain-of-Thought generation phase pays significantly more attention to the question tokens when RE2 is used.

Quantitatively, the attention weight assigned to the question tokens during generation increased from 0.32 to 0.40 when using RE2. This confirms that re-reading makes the model “respect” the source text more during its reasoning process.

Conclusion

The RE2 method is a testament to the fact that we have not yet hit the ceiling of what current Large Language Models can do with simple prompting changes.

By acknowledging the architectural limitation of unidirectional attention—the inability to “look ahead”—and fixing it with a human-inspired strategy of re-reading, we can unlock better reasoning performance.

Key Takeaways:

Simplicity: RE2 requires no training, just a prompt modification: Q: {Input} Read the question again: {Input}.
Bidirectional Understanding: It allows decoder-only models to process the start of a sentence with full knowledge of the end of the sentence.
Universality: It improves performance across math, commonsense, and symbolic reasoning, and works on models from ChatGPT to LLaMA-2.
Efficiency: It offers significant gains for negligible increases in compute time.

For students and practitioners working with LLMs, RE2 serves as a reminder: sometimes the best way to improve artificial intelligence is to teach it to mimic the basic study habits of human intelligence. When in doubt, read it again.

The Problem: The Limits of Unidirectional Attention#

The Solution: RE2 (Re-Reading)#

How It Works#

The “Bidirectional” Effect#

Formalizing RE2#

Experiments and Results#

Arithmetic Reasoning#

Does Re-Reading More Help?#

Handling Complexity#

Compatibility with Other Methods#

What about Efficiency?#

Deep Dive: Visualizing the Attention Shift#

Conclusion#