Beyond the Training Loop: Unlocking LLM Reasoning with Inference-Time Tricks

Large Language Models (LLMs) have become astonishingly capable, solving problems once reserved for human experts—in mathematics, coding, and scientific reasoning. Traditionally, we improve these models by scaling them up and retraining on ever-larger datasets, a process that demands immense computational resources. But what if we could make an existing model think better without retraining at all?

That idea lies at the heart of inference-time computation. Inspired by human reasoning, this technique gives models more time and computation during testing—letting them “pause and deliberate” before deciding. These methods can significantly boost reasoning performance without touching a single model parameter.

The field, though, is still chaotic: different methods, inconsistent setups, varying success across tasks. The paper “Bag of Tricks for Inference-time Computation of LLM Reasoning” by Fan Liu et al. brings much-needed structure. Using over 20,000 GPU-hours, the authors systematically explored simple but impactful “tricks” that determine how well LLMs reason during inference. Let’s unpack their insights and what they mean for improving reasoning efficiency.

The Proposer–Verifier Pipeline: Teaching Models to Think in Two Steps

Most inference-time strategies follow a two-phase workflow—a proposer–verifier pipeline:

Propose: The model generates several candidate reasoning paths or answers.
Verify: A separate component evaluates those candidates and selects the best one.

This approach mirrors how humans think: brainstorm ideas first, then test and select the soundest solution.

An overview diagram showing the components of inference-time computation, including instruction prompts, reasoning tasks, inference models, reward models, and computation methods.

Figure 1: Overview of inference-time computation. The pipeline connects instruction prompts, reasoning tasks, inference models, reward models, and computational strategies such as Best-of-N Sampling, Self-Consistency, MCTS, and Self-Refine.

The paper dissects both stages—how to generate better candidate solutions and how to choose among them more effectively. Each stage hides subtle yet powerful levers.

Part 1: The Art of Generating Better Candidate Solutions

A model’s final answer can be only as good as the best candidate it produces. The authors examined three crucial factors that govern the diversity and quality of those candidates.

1. Instruction Prompt Type: Guiding the Thinking Process

The way we ask a question defines how the model reasons. The study evaluated three common prompt styles:

Input–Output (IO): Directly asks for the final answer.
Chain-of-Thought (CoT): Encourages step-by-step reasoning.
Reflect CoT: Expands CoT by asking the model to reflect and verify each step.

Four bar charts showing the accuracy of different models on GSM8K, MATH500, HumanEval, and Bamboogle tasks using IO, CoT, and Reflect CoT prompts. Reflect CoT and CoT consistently outperform IO.

Figure 2: Accuracy across reasoning tasks with different prompt types. CoT consistently outperforms direct IO prompting, while Reflect CoT yields mixed results.

The findings are decisive. CoT prompting outperforms IO by a wide margin—models do better when they explicitly outline their reasoning chain. The reflective variant, however, didn’t guarantee improvement. LLMs currently struggle with self-correction: “reflecting on mistakes” often amplifies them instead.

Takeaway: Use Chain-of-Thought prompting—it delivers a reliable lift in reasoning accuracy without added complexity.

2. Temperature: Balancing Randomness and Precision

In LLMs, temperature (τ) controls randomness in output sampling:

Low τ (e.g., 0.2): highly deterministic, minimal exploration.
High τ (e.g., 1.0): more creative, but less coherent.

When generating multiple candidates, diversity matters—but excessive randomness undermines reasoning quality.

Four line graphs showing the accuracy of five models across four reasoning tasks as the temperature changes from 0.6 to 1.0. Performance for most models peaks around a temperature of 0.8.

Figure 3: Accuracy versus temperature. Most models across tasks peak around τ = 0.8.

Across all tested tasks, accuracy peaked around τ = 0.8, showing the best balance between confidence and exploration. Smaller or larger values reduced performance, confirming that moderate diversity is crucial for reasoning optimization.

Takeaway: Temperature ≈ 0.8 should be your default for effective candidate diversity.

3. Top-p: Controlling Vocabulary Breadth for Coherence

Top-p (nucleus sampling) selects tokens from the smallest subset whose cumulative probability exceeds a threshold p—restricting the model’s vocabulary dynamically.

Low p (e.g., 0.6): more focused but rigid.
High p (e.g., 1.0): diverse but may include implausible tokens.

Four line graphs showing the accuracy of different models on four tasks as the Top-p value changes from 0.6 to 1.0. Performance generally improves and stabilizes at a Top-p of 0.9.

Figure 4: Accuracy versus Top-p parameter. Performance stabilizes around Top-p = 0.9 across models and tasks.

Performance rose steadily until p ≈ 0.9, after which improvements flattened. This value preserves diversity while maintaining the coherence of reasoning.

Takeaway: Set Top-p ≈ 0.9 for best overall results.

Part 2: Selecting the Optimal Solution

Now that we have multiple candidates, how do we identify the best one? This “verifier” phase determines how effectively the model’s deliberation translates into accurate answers. The authors explore two core approaches.

1. Self-Evaluation: Can Models Grade Their Own Reasoning?

A natural idea: ask the LLM to assess its own work. For example, “Review the following solutions and decide which is most likely correct.” The authors tested this approach against random selection and majority voting.

Four bar charts comparing different evaluation strategies (including self-evaluation) on four reasoning tasks. Self-evaluation methods often perform no better, and sometimes worse, than majority voting or random selection.

Figure 5: Comparison of self-evaluation and external selection strategies. Self-evaluation often fails to improve accuracy.

The results are humbling. LLMs are poor judges of their own answers. Self-evaluation methods often perform no better—and sometimes worse—than random selection. Models tend to repeat the same reasoning errors rather than recognizing them.

Takeaway: Avoid self-evaluation for critical verification. Use external evaluators or structured heuristics instead.

2. Reward Models: External Judges that Score Reasoning Quality

If the model cannot self-assess, bring in an external reward model (RM)—a separate system trained to score candidate outputs. The study compared several types:

LLM-as-Judge: A large model asked to verify reasoning step-by-step.
RLHF Reward: Trained on human preference data.
Proof-Critical Reward: Tailored for formal mathematical proofs.

Four bar charts showing the performance of different reward models on four tasks. The effectiveness of each reward model varies significantly by task.

Figure 6: Different reward models show varied effectiveness across reasoning tasks.

Findings varied by domain:

For knowledge-based reasoning, the RLHF reward model excelled.
For math and code, the LLM-as-Judge that assessed process correctness offered the strongest gains.

Evaluating how an answer was derived often proved more valuable than judging just the final result.

Takeaway: For complex reasoning tasks, use process-based reward models—they better capture true logical quality.

3. The Generalization Gap: More Isn’t Always Better

Intuitively, generating more candidates (larger N in Best-of-N) should improve results—but it doesn’t always. In some cases, performance declined as N increased.

Four line graphs showing model performance as the number of candidates (N) increases for three reward models. In tough tasks like MATH, performance can drop at larger N.

Figure 7: Test-time scaling with different reward models. More candidates do not guarantee better outcomes—especially for hard reasoning tasks.

The reason? Reward model generalization limits. As the candidate space grows, RMs sometimes mistake plausible-but-wrong answers for correct ones, inflating scores. This causes “performance illusion,” where larger computations yield lower true accuracy.

Takeaway: Reward models themselves can become bottlenecks; their ability to generalize across reasoning patterns remains an open challenge.

Part 3: Benchmarking Inference-Time Computation Methods

Armed with these insights, the authors benchmarked six prominent strategies under fair, fixed token budgets: Best-of-N, Step-level Best-of-N, Beam Search, MCTS, Self-Consistency, and Self-Refine.

Table 3 from the paper showing how combinations of tricks (prompt, reward, temp, top-p) affect performance on Bamboogle and MATH tasks for the Llama3.1-8b model.

Figure 8: Optimal combinations of inference-time tricks are task-dependent. Improvements are not always additive.

The benchmark provides critical lessons:

No Universal Winner: Best-of-N and Self-Consistency excelled for factual reasoning (e.g., QA), while math-heavy tasks favored larger and more specialized models.
More Tokens ≠ More Accuracy: Beam Search consumed more tokens but yielded minimal improvement—underscoring inefficiency.
Efficiency Counts: Self-Consistency and Self-Refine delivered higher accuracy at the same computational cost.

Four scatter plots showing accuracy versus the number of tokens used for six inference-time computation methods across tasks. Some methods are more token-efficient than others.

Figure 15: Efficiency comparison across inference-time strategies. Self-Consistency and Self-Refine achieve higher accuracy at moderate token budgets.

These benchmarks set a standard: future methods must be judged not just by raw accuracy but also by token efficiency—how much computation each gain costs.

Core Takeaways: A Cheat Sheet for Smarter Inference

From thousands of experiments across multiple models and tasks, several practical insights stand out:

Prompts Matter: Chain-of-Thought is essential; self-reflective variants can misfire.
Tune for Balance: Use τ ≈ 0.8 and Top-p ≈ 0.9 for robust reasoning diversity.
External Verification Wins: Reward models outperform self-evaluation—especially process-focused judges for math and code.
Beware Generalization Gaps: Reward models can overfit or misjudge plausible wrong answers.
Benchmark Fairly: Evaluate computation methods under controlled budgets, not just absolute scores.

The Bigger Picture

This research reframes LLM optimization: improving reasoning doesn’t require retraining or larger models—it can emerge from smarter inference-time configurations. Adjusting sampling parameters, prompt design, and reward mechanisms can elevate a model’s reasoning capabilities significantly.

By mastering these “bag of tricks,” practitioners can make existing models more thoughtful, efficient, and reliable—pushing LLM reasoning closer to human-level deliberation without adding a single parameter.

In short: Smarter thinking at inference is a genuine alternative to endless retraining.

The Proposer–Verifier Pipeline: Teaching Models to Think in Two Steps#

Part 1: The Art of Generating Better Candidate Solutions#

1. Instruction Prompt Type: Guiding the Thinking Process#

2. Temperature: Balancing Randomness and Precision#

3. Top-p: Controlling Vocabulary Breadth for Coherence#

Part 2: Selecting the Optimal Solution#

1. Self-Evaluation: Can Models Grade Their Own Reasoning?#

2. Reward Models: External Judges that Score Reasoning Quality#

3. The Generalization Gap: More Isn’t Always Better#

Part 3: Benchmarking Inference-Time Computation Methods#

Core Takeaways: A Cheat Sheet for Smarter Inference#

The Bigger Picture#