Training an LLM to Be Its Own Toughest Critic

Large Language Models (LLMs) have become astonishingly good at sounding human, but when it comes to complex, multi-step reasoning—say, solving a tricky math problem or debugging a program—they often stumble. One common fix is to simply give the model more “thinking time” during inference: let it generate multiple answers and choose the best. The problem? If the model isn’t good at judging its own work, it will just produce lots of wrong answers faster. Like asking a student who doesn’t understand algebra to solve a hundred equations—they’ll make the same mistakes over and over.

That’s the challenge tackled by researchers from Fudan University and Meituan in their paper Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision. Their central insight: to make LLMs truly effective reasoners, we must teach them not just to think, but to reflect—and not through self-reflection alone. They propose a two-player system. One model, the actor, does the reasoning. Another, the critique model, serves as an expert coach, providing step-by-step feedback so the actor can identify and fix mistakes.

This work introduces an end-to-end framework to build such critique models: from automatically generating critique data without human labeling, to training dedicated critics, to integrating their supervision during both inference and training. The results are impressive—the actor-critic synergy not only boosts accuracy but reshapes how models explore, learn, and improve, especially on the most challenging problems.

Before diving into how this system works, let’s clarify the roles of each player:

The Actor (\(\pi_\theta\)) — The problem-solver LLM. It receives an input \(x\), such as a math question, and generates a reasoning sequence \(y\).
The Critique Model (\(\pi_\phi\)) — The analytic coach. Given the same input and the actor’s output, it produces step-level feedback \(c\), judging the correctness of each step and explaining errors.
Refinement — The actor takes the critique and revises its answer, producing a new, improved response \(y'\).

This process can repeat multiple times: the critic reviews, the actor refines, and both iterate toward a more reliable solution—a miniature feedback loop of thought and correction.

An example of the response, critique, and refinement process. The critique model provides step-by-step feedback that helps the actor correct its errors and arrive at the right answer.

Figure: Example of the actor receiving feedback from the critique model and producing a refined correct solution.

AutoMathCritique: A Factory for Generating High-Quality Feedback

Training a strong critic depends on data—a large volume of flawed solutions paired with detailed corrective feedback. Manually producing this data is expensive and slow, so the authors created AutoMathCritique, a three-stage automated pipeline for synthesizing precise critique data at scale.

The three-stage overview of the AutoMathCritique framework: constructing flawed paths, generating critiques, and filtering the data for quality.

Figure: The AutoMathCritique workflow comprising flawed path construction, critique generation, and data filtering.

Stage 1: Constructing Flawed Reasoning Paths

Critique models learn by seeing mistakes. To ensure diverse and informative errors, AutoMathCritique generates flawed reasoning paths using three complementary strategies:

RG1 – Sampling from Scratch: Let the actor (Llama3-8B) solve problems freely. Some attempts will be wrong—simple, spontaneous errors that offer authentic examples.
RG2 – Error-Location-Aware Responses: Begin with a correct solution, then deliberately introduce randomness in later steps to produce errors at known points. This yields traceable fault locations useful for critique alignment.
RG3 – Adding Detailed Mistakes: Start with a correct solution and intentionally instruct the model to inject a specific type of error—such as a miscalculation or logic slip—at a particular step. This produces data with labeled error types and positions.

Together, these strategies create rich flawed reasoning paths, enabling the critic to learn where, how, and why reasoning goes wrong.

Stage 2: Generating Critiques

Next, an annotator LLM (GPT-4o) reviews each flawed solution and provides structured critiques. Depending on the available information from stage one, it may receive error hints or reference answers to make its step-by-step judgments. The model labels each step as Correct or Wrong, explaining the reasons in natural language—essentially emulating a human grader debunking a student’s logic.

Stage 3: Filtering for Quality

Not every critique is useful. To ensure quality, the authors employ a clever filtering step based on Monte Carlo sampling. Each (query, response, critique) triplet is passed back to the actor model 10 times for refinement. If at least 30% of these refinements yield correct solutions, the critique is considered helpful and retained. Otherwise, it’s discarded. This probabilistic filtering acts as an automatic check for critique effectiveness.

Applying this pipeline on the GSM8K and MATH datasets produced the MathCritique-76k dataset: 76,321 high-quality annotated examples of mathematical reasoning chains supplemented with constructive feedback.

Statistics of the MathCritique-76k dataset, showing the number of unique queries, golden reasoning paths, and generated critiques.

Figure: Statistics of the MathCritique-76k dataset used for training critique models.

Putting the Critic to Work: Supervision at Test-Time and Training-Time

Armed with MathCritique-76k, the authors fine-tuned new Llama3-based critique models and experimented with two major use cases: test-time supervision and training-time supervision.

An illustration of how critique models can provide supervision during test-time (left) through parallel or sequential refinement, and during training-time (right) by improving data quality for self-improvement.

Figure: Critique models providing feedback at test-time (left) and supervision during training-time (right).

Test-Time Supervision: A Coach During the Game

At inference, the actor first produces a solution. The critique model then reviews it, highlights flaws, and the actor refines accordingly. This simple loop leads to large gains.

As shown below, the fine-tuned 8B critique model outperforms GPT-3.5-Turbo as a critic, while the 70B model achieves results on par with GPT-4. They excel not only at identifying errors (discriminability) but also at generating actionable feedback that helps fix mistakes (helpfulness).

A comparison of different critique models on the GSM8K and MATH datasets. The authors’ fine-tuned critique models show strong performance in accuracy, discriminability, and helpfulness.

Figure: Evaluation of critique model performance compared with GPT-3.5 and GPT-4 series critics.

Where Critique Helps the Most: Hard Problems

Figure 5 reveals the largest performance improvements occur on difficult questions—where the actor would otherwise struggle. Easy problems see minor or no gains, confirming that external supervision is most useful when reasoning complexity rises.

Performance gains from using a critique model across five difficulty levels. The improvement is most significant on more challenging queries.

Figure: Performance improvement from critique models across difficulty tiers.

Breaking Through the Performance Ceiling

A popular strategy to boost reasoning is majority voting (Maj@K)—generate \(K\) solutions and choose the most frequent answer. But without critics, performance quickly plateaus even as computation triples. With a critique model guiding refinements, the ceiling lifts noticeably: performance continues climbing with more samples, showing that “smarter compute” beats “more compute.”

Performance of majority voting (Maj@K) with and without a critique model as the number of samples (computation) increases. The critic consistently raises the performance ceiling.

Figure: Majority voting performance scales much better with critique supervision.

Training-Time Supervision: Critique-in-the-Loop Self-Improvement

test-time successes inspired a deeper integration: can the critic help during training too?

This builds on self-improvement, where the model learns iteratively from its own correct outputs. Standard self-improvement faces tail narrowing—the training data becomes dominated by easy problems, leaving hard examples underrepresented and stunting progress.

The paper proposes critique-in-the-loop self-improvement to overcome this issue. During training, incorrect responses aren’t discarded; they’re handed to the critique model, which provides feedback. The actor then refines and, if correct, the new solution is added to the dataset. This way, difficult problems receive proportionally more attention, and the model’s learning distribution stays balanced.

They further adopt a difficulty-aware computation allocation: more sampling, critique, and refinement for difficult cases, less for simple ones.

The results speak for themselves. As shown in Figure 6, the critique-in-the-loop method exhibits consistent improvements across iterations and outperforms vanilla self-improvement, especially when exploration samples \(N\) increase.

A comparison of vanilla self-improvement and critique-in-the-loop self-improvement over three iterations. The critique-based method achieves higher and more stable performance.

Figure: Performance comparison of vanilla vs. critique-in-the-loop self-improvement.

The training distribution also improves dramatically. Figure 7 shows a higher proportion of hard problems added to the training data, correcting the tail-narrowing imbalance.

The relative change in the proportion of training data across difficulty levels. The critique-in-the-loop method samples more solutions for hard problems and fewer for easy ones compared to the vanilla method.

Figure: Critique-in-the-loop sampling increases representation of tougher problems.

Evaluation on test sets confirms the effect: models trained with critique supervision outperform vanilla self-improvement by a wide margin on the hardest cases.

Performance on the test set broken down by difficulty. The model trained with critique-in-the-loop self-improvement significantly outperforms the vanilla model on harder problems.

Figure: Test performance across difficulty levels after critique-in-the-loop training.

Deeper Analysis and Insights

Synergy Between Training and Test-Time Critique

Combining both test-time and training-time supervision yields the strongest results. Table 3 shows that models trained with critique-in-the-loop already incorporate the critic’s reasoning awareness—performing robustly even without external critique at test-time. In contrast, self-correcting models (that critique themselves) lag far behind due to unreliable self-assessment.

A table showing the performance of different combinations of training-time and test-time methods. The best results are achieved by combining critique-in-the-loop training with test-time critique supervision.

Figure: Comparing combinations of training-time and test-time supervision approaches.

Can a Small Critic Coach a Big Actor?

Surprisingly, yes. A 3B-parameter critique model was able to guide actor models ranging from 1.5B up to 14B in size. Smaller critics provided consistent benefits—especially on the harder MATH dataset—highlighting that analytical oversight doesn’t necessarily require immense scale.

The performance of actor models of different sizes with and without supervision from a 3B critique model. The smaller critic provides a consistent boost across all scales.

Figure: Even small critics effectively supervise much larger actor models.

When scaling test-time computation, should responses be refined in parallel or sequentially? Experiments show parallel refinement, where multiple (response, critique, refinement) triplets are generated independently, tends to yield more diverse exploration and better results for metrics like Pass@K. Sequential strategies, which refine one response repeatedly, can sometimes overwrite correct answers and offer less variety.

A comparison of Pass@K performance for parallel and sequential computation scaling. The parallel approach tends to have a slight edge.

Figure: Parallel sampling strategies generally outperform sequential refinement when scaling computation.

The Next Frontier: Teaching Self-Talk Models

The two-player paradigm—actor and critic—is effective but involves two separate models. Inspired by LLMs that engage in self-talk, where they verbalize their reasoning and corrections, the authors attempt to merge both roles into a single model.

They introduce Self-Talk-via-Critique, which converts structured (reasoning → critique → refinement) data into smooth, natural self-dialogue.

An overview of the three-step process for synthesizing self-talk data from critique data.

Figure: The Self-Talk-via-Critique pipeline produces coherent internal reasoning and reflection chains.

The process:

Insert Critiques into the Reasoning Chain — Feedback from the critique model is woven into the reasoning trajectory, creating an annotated thought process with step-level reflections.
Iteratively Refine and Critique — Errors detected in steps are corrected, and new critiques are integrated until all reasoning is sound.
Smooth into Self-Talk Style — Transitional phrases and connectors are added, producing a natural thought flow (“Wait, let’s check that…”).

An example is shown below.

An example comparing a standard chain-of-thought response with a self-talk response generated via the critique-to-self-talk pipeline.

Figure: A conventional reasoning trace versus a step-level self-talk version created by critique synthesis.

When fine-tuned on this self-talk data, models achieved significant gains over standard self-correction baselines. While still behind explicit two-player setups, this demonstrates a promising step toward single models capable of reasoning, reflecting, and correcting themselves more naturally.

Evaluation results showing that the step-level self-talk model outperforms trajectory-level self-correction, though it still lags behind the two-player setting.

Figure: Evaluation showing step-level self-talk improves over traditional self-correction baselines.

Conclusion and Takeaways

This research makes a strong case for critique-based reasoning as a foundation for the next generation of intelligent models. By splitting roles into actor and critic, and letting them interact iteratively, the authors have shown how models can learn to reason more reliably, efficiently, and intelligently.

Key insights:

Automated Data Curation at Scale: The AutoMathCritique framework enables high-quality critique dataset creation without human annotation, paving the way for scalable supervision.
Critics Shine on Hard Problems: Dedicated critique models deliver the biggest gains on the most complex reasoning tasks.
Balanced Self-Improvement: Critique-in-the-loop training solves the tail-narrowing issue, producing reasoning models that improve evenly across problem difficulties.
Toward Scalable Oversight: Separating reasoning and evaluation roles sets the stage for safer, more reliable, self-improving AI systems.

While AutoMathCritique was applied to mathematical reasoning, its principles extend far beyond—to scientific discovery, software engineering, and autonomous decision-making. The future of LLM reasoning may not depend solely on making models bigger but on making them better critics—of themselves, and of each other.

Training an LLM to Be Its Own Toughest Critic

The Two-Player Game: Actor, Critique, and Refinement

AutoMathCritique: A Factory for Generating High-Quality Feedback