Fewer is More: How CoT-Influx Turbocharges LLM Math Reasoning

Large Language Models (LLMs) like GPT-4 and LLaMA-2 are linguistic wizards, capable of writing poetry, code, and essays with ease. Yet, ask them a multi-step grade school math problem, and they often stumble.

The standard solution to this problem is Chain-of-Thought (CoT) prompting—giving the model a few examples of how to solve similar problems step-by-step before asking it to solve a new one. This is known as few-shot learning. Intuitively, the more examples you show the model, the better it should perform. But there is a hard ceiling: the context window.

If you stuff too many examples into the prompt, you hit the token limit. If you try to compress those examples using standard methods, you often lose crucial numbers or operators, rendering the examples useless for math.

In the paper “Fewer is More: Boosting Math Reasoning with Reinforced Context Pruning,” researchers from HKUST and Microsoft Research Asia introduce CoT-Influx. This novel approach uses a coarse-to-fine pruning mechanism to squeeze more high-quality reasoning examples into the context window by removing unhelpful examples and redundant words. The result? A significant boost in math reasoning capabilities, allowing open-source models like LLaMA2-70B to outperform GPT-3.5 without any fine-tuning.

The Pilot Study: Why We Need Better Pruning

Before building their system, the researchers conducted a pilot study to understand the relationship between the number of examples (shots) and reasoning performance. They tested LLaMA2-7B on the GSM8K math dataset.

Observation 1: More is Better (Usually)

The standard practice is to use 8 manually designed CoT examples. The researchers found that increasing the number of examples generally improves accuracy.

Figure 1: LLaMA2-7B reasoning accuracy under an increasing number of TopK retrieved CoT examples. Figure 2: A compressed CoT example using the prompt compression tool of LLMLingua.

As shown in the top chart of the image above (Figure 1), accuracy climbs as we go from 0 to 16 examples. However, once we hit 20 examples, we run into the context window limit (indicated by the red box). The model doesn’t have enough space left to generate the answer, causing performance to crash.

Observation 2: Quality Matters

Quantity isn’t everything. Simply dumping random examples into the prompt actually hurts performance compared to carefully curated ones.

Table 1: The selection of CoT examples heavily impacts LLM math reasoning performance on GSM8K.

Table 1 illustrates that random 16-shot examples perform worse than the standard manual 8-shot examples. This implies we need a smart way to select which examples to include.

Observation 3: Redundancy is Rife

Finally, the researchers looked at the text within the examples. Natural language is redundant. We use many filler words (“Let’s think step by step,” “The answer is,” “There are”) that don’t strictly contribute to the mathematical logic.

However, existing compression tools are dangerous for math. Look at the bottom half of the first image (Figure 2). It shows an example compressed by LLMLingua. The tool pruned “unimportant” tokens, but it accidentally removed the number “15” (colored in red). In math, if you delete the numbers, the reasoning becomes hallucination.

The Hypothesis: If we can select the most helpful examples (Observation 2) and prune only the truly redundant words while keeping the math intact (Observation 3), we can fit significantly more examples into the context window (Observation 1), thereby boosting performance.

Enter CoT-Influx: The Method

CoT-Influx is a plug-and-play module that sits between the user and the LLM. It doesn’t require fine-tuning the massive LLM itself. Instead, it processes the input prompt to maximize information density.

Figure 3: Above: The overview procedure of CoT-Influx; Below: an example illustrating the use of CoT-Influx to first prune entire CoT examples and then prune tokens.

As illustrated in Figure 3, the system follows a Coarse-to-Fine pruning strategy:

Shot Pruner: From a large pool of retrieved examples (e.g., top-k relevant problems), it selects the most crucial entire examples to keep.
Token Pruner: For the examples that survived step 1, it prunes individual tokens (words, spaces) that are redundant.

The Mathematical Formulation

The goal is to select a subset of examples (\(\hat{\mathcal{D}}\)) and compress them (\(\hat{x}\)) such that the LLM’s accuracy is maximized, while the total token count (\(t\)) remains under the context limit (\(T\)).

The transformation process looks like this:

Equation 1: The pruning pipeline from dataset to shot pruning to token pruning.

The optimization objective is twofold: minimize the LLM’s loss (perplexity) and maximize the reasoning accuracy (\(R_{Acc}\)), subject to the length constraint:

Equation 2: The optimization objective function combining loss, accuracy, and token constraints.

The Architecture

Since CoT-Influx is an external module, it needs its own way to “read” the text. The authors use BERT-Large as a feature extractor.

Input: A batch of potential CoT examples.
Embedding: BERT creates vector representations (\(H_{s}\)) of these examples.
Policy Networks: Two lightweight Multi-Layer Perceptrons (MLPs) act as the brains of the operation.

The Shot Pruner calculates the probability of keeping each example (\(a_{shot}\)):

Equation 3: The policy network for shot pruning.

The Token Pruner calculates the probability of keeping each token (\(a_{token}\)) within the retained examples:

Equation 4: The policy network for token pruning.

The outputs are simple binary decisions: 1 (keep) or 0 (prune).

Training with Reinforcement Learning

Here lies the main technical challenge. “Pruning” is a discrete operation—you either keep a token or you don’t. This breaks the computational graph, meaning we can’t use standard backpropagation (gradient descent) to train the pruner based on the LLM’s output.

To solve this, the authors employ Reinforcement Learning (RL), specifically the REINFORCE algorithm.

The Reward Function

The agent (the pruner) needs to know if it did a good job. The reward function (\(R\)) is a composite of three factors:

Effectiveness: How well does the LLM predict the answer? (Measured by \(\frac{1}{1 + L_{LLM}}\)).
Accuracy: Did the LLM get the final math answer right? (\(R_{Acc}\)).
Length Constraint: Did the prompt fit in the window? (The term \([\frac{t}{T}]^w\)).

Equation 5: The multi-objective reward function.

If the pruner cuts crucial numbers (causing the LLM to fail) or leaves too much fluff (exceeding the window), the reward drops.

The parameters (\(\theta\)) of the policy networks are updated using the policy gradient:

Equation 6: The gradient update rule for the policy network.

Stabilizing the Training

Training RL for text generation is notoriously unstable. The authors used two clever tricks:

Difficulty-Aware Filtering: They used GPT-4 to create a dataset (MRD³) with varied difficulty levels. During training, they filter for easier questions. If a question is too hard, the LLM will get it wrong regardless of the prompt quality, providing no useful signal to the pruner.
Anchor Shots: In the early stages of training, the pruner might randomly delete everything. To prevent this, they append a few “safe,” manually designed examples to ensure a baseline level of performance.

Experimental Results

The researchers evaluated CoT-Influx on various LLaMA2 models (7B, 13B, 70B) across multiple math datasets, including GSM8K.

Pushing the Boundary of Few-Shot Learning

Does CoT-Influx actually allow for more examples? Yes.

Figure 4: EM(%) accuracy on GSM8K with inputting different number of CoT examples for CoT-Influx.

Figure 4 shows the accuracy of CoT-Influx (blue line) compared to standard TopK retrieval (grey line) and fixed few-shot (black line).

Performance Peak: LLaMA2-13B achieves peak performance at 48 shots.
Efficiency: Even with 48 examples, the token pruning keeps the input manageable.
Impact: The accuracy gain is substantial compared to the baselines.

Comparison with State-of-the-Art

Table 2 compares CoT-Influx against other methods, including aggressive compression techniques like LLMLingua and Selective Context.

Table 2: Comparison of EM (%) accuracy on GSM8K with state-of-the-art baselines.

Key Takeaways:

Compression Baselines Fail: Notice that methods like TopK+Selective Context result in abysmal accuracy (0.45% for LLaMA2-7B). This confirms that generic compression destroys mathematical reasoning logic.
CoT-Influx Dominates: It achieves 59.59% accuracy on LLaMA2-70B, significantly higher than the standard few-shot approach (55.42%).

Beating the Giants

Perhaps the most impressive result is how this method allows open-source models to punch above their weight class.

Table 4: Comparison of EM (%) accuracy on GSM8K with larger LLMs under the few-shot-CoT setting.

As shown in Table 4, LLaMA2-70B equipped with CoT-Influx (59.6%) outperforms GPT-3.5 (57.1%) and comes remarkably close to the Google Minerva 540B model. This is achieved without fine-tuning the model weights—purely by optimizing the context.

What about Long-Context Models?

One might argue: “Why do we need pruning? Newer models have context windows of 32k or 200k tokens. Just put all the examples in!”

The authors tested this hypothesis on Mistral-7B-32K and Yi-6B-200K.

Figure 5: Prompt compression effect on long-context LLMs. The X-axis indicates the number of input tokens.

Figure 5 reveals a counter-intuitive finding:

Diminishing Returns: Simply increasing the context (orange lines) does not linearly increase accuracy; it often fluctuates or degrades.
Efficiency Wins: CoT-Influx (blue line) achieves higher accuracy with a fraction of the tokens. For Yi-6B-200K, it achieves ~2.5% higher accuracy with 15x fewer input tokens. This translates to massive savings in inference cost and latency.

How Much is Pruned?

The system is aggressive but selective.

Figure 6: Token length after each stage of our pruner.

Figure 6 shows the pruning ratio. The Shot Pruner does the heavy lifting (3.87x compression), filtering out less relevant examples. The Token Pruner refines the rest (bringing total compression to 4.28x).

An ablation study (Table 9) confirms that both stages are necessary. Removing the token pruner hurts performance significantly, proving that word-level redundancy is a major bottleneck.

Table 9: Comparison of different pruning strategies.

Why Does It Work? The Importance of Dataset

A crucial component of this success is the dataset used to train the pruner: MRD³ (Math Reasoning Dataset with Diverse Difficulty).

Using GPT-4, the authors evolved standard math problems to create variations with different constraints and reasoning depths.

Figure 7: The difficulty distribution (first row) and the number of reasoning steps distribution (second row).

This diversity is vital. Interestingly, the researchers found that different LLMs prefer different examples.

Small Models (7B): Prefer simpler, easier examples.
Large Models (70B): Benefit more from complex, multi-step examples.

Table 17: Smaller, less capable LLMs favor simpler CoT examples, while larger ones prefer more complex ones.

CoT-Influx automatically learns these preferences, selecting the “Goldilocks” examples for the specific model it is serving.

Conclusion

CoT-Influx demonstrates that when it comes to in-context learning, quality and density beat quantity. By intelligently selecting helpful examples and surgically removing redundant words, we can feed LLMs “super-prompts” that drastically improve their math reasoning skills.

This approach offers a practical path forward for deploying LLMs. Instead of relying on expensive fine-tuning or massive, slow-context windows, we can use lightweight, reinforced pruning modules to unlock the latent reasoning potential of existing models.

As the paper title suggests, in the world of efficient AI, fewer tokens—if they are the right tokens—really is more.

The Pilot Study: Why We Need Better Pruning#

Observation 1: More is Better (Usually)#

Observation 2: Quality Matters#

Observation 3: Redundancy is Rife#

Enter CoT-Influx: The Method#

The Mathematical Formulation#

The Architecture#

Training with Reinforcement Learning#

The Reward Function#

Stabilizing the Training#

Experimental Results#

Pushing the Boundary of Few-Shot Learning#

Comparison with State-of-the-Art#

Beating the Giants#

What about Long-Context Models?#

How Much is Pruned?#

Why Does It Work? The Importance of Dataset#

Conclusion#