The capabilities of Large Language Models (LLMs) have exploded in recent years, particularly in their ability to perform “Chain-of-Thought” (CoT) reasoning. We’ve seen models solve complex calculus problems and write code by breaking problems down into step-by-step logic. But there is a glaring disparity in where this reasoning works best.

While AI has become a math wizard, its ability to rigorously reason step-by-step in domains like Law, Biology, or Philosophy has lagged behind. Why? Because the mechanisms we use to verify “good reasoning”—specifically Process Reward Models (PRMs)—have been almost exclusively trained on math data.

In this deep dive, we are exploring a new paper, VersaPRM, which proposes a solution to this generalization problem. The researchers introduce a pipeline for generating synthetic reasoning data across diverse fields, training a “Versatile” PRM that lifts performance not just in algebra, but in legal argumentation and biological classification.

The Problem: The “Math-Only” Trap

To understand the contribution of VersaPRM, we first need to look at how we currently improve LLM reasoning. A popular technique involves generating multiple potential solutions to a problem and then using a separate model—a Reward Model—to score them and pick the best one.

There are two main types of reward models:

  1. Outcome Reward Models (ORMs): These look at the final answer. Did the model get “42”? If yes, good. If no, bad.
  2. Process Reward Models (PRMs): These are far more powerful. They look at every single step of the reasoning. They can tell the model, “Step 1 and 2 were great, but you made a logic error in Step 3.”

PRMs have been the secret sauce behind recent jumps in mathematical reasoning benchmarks (like GSM8K and MATH). However, because labeling every step of a reasoning chain is expensive and tedious for humans, researchers have mostly relied on existing math datasets.

The result? We have “Math PRMs” that are excellent at spotting a calculation error but are effectively blind when asked to evaluate a legal argument or a chemical process.

Figure 1 illustrating the performance gap between current Math PRMs and the proposed VersaPRM.

As shown in Figure 1 above, existing open-source Math PRMs (top half) perform well in mathematics but fail to generalize. When applied to Law, Philosophy, or Biology, they perform roughly the same as—or sometimes worse than—a simple baseline. They simply don’t know what “good reasoning” looks like outside of an equation.

The bottom half of Figure 1 previews the solution: VersaPRM. By training on a multi-domain dataset, the model achieves consistent performance gains across all categories.

Background: Scoring and Searching

Before we look at how VersaPRM is built, let’s establish the technical foundation of how a PRM actually helps an LLM.

When an LLM generates a Chain-of-Thought (CoT), let’s say a sequence of \(k\) steps \(S = (s_1, s_2, \dots, s_k)\), the PRM assigns a score to each step. A score of 1.0 means the step is perfect; 0.0 means it’s a hallucination or logic error.

But to rank a full solution, we need to aggregate these step scores into a single number. The paper explores three ways to do this:

1. Min-Aggregation: This method takes a “weakest link” approach. The score of the entire solution is determined by its worst step. If you have 10 brilliant steps and one logical fallacy, the whole solution is penalized.

Equation for Min-Aggregation.

2. Last-Aggregation: This method relies solely on the score of the final step, assuming that if the reasoning led to a confident final conclusion, the path was likely correct.

Equation for Last-Aggregation.

3. Average-Aggregation: This simply averages the quality of all steps.

Equation for Average-Aggregation.

The researchers found that Min-Aggregation generally performs best. In reasoning, a single false premise usually invalidates the conclusion, making the “weakest link” logic the most robust metric.

Inference Strategies

Once we have these scores, we can use “Test-Time Compute” algorithms to improve results. This basically means spending more computational power during the answering phase to ensure quality.

  • Majority Voting (MV): Generate \(N\) solutions and pick the answer that appears most often. (No PRM needed).
  • Weighted Majority Voting (WMV): Generate \(N\) solutions, but “weight” their votes based on their PRM score. A high-quality reasoning chain gets more voting power.

Equation for Weighted Majority Voting.

  • Best-of-N (BoN): Generate \(N\) solutions and simply pick the one with the highest aggregated PRM score.

Equation for Best-of-N.

The VersaPRM Method: Synthetic Data Pipeline

The biggest bottleneck to creating a multi-domain PRM is data. Hiring lawyers, biologists, and philosophers to manually label millions of reasoning steps is prohibitively expensive.

The researchers solved this by building a fully automated Synthetic Data Generation Pipeline. They essentially created a “teacher-student” loop using LLMs.

The Pipeline Architecture

The process, illustrated in Figure 2, involves two distinct stages: Generation and Auto-Labeling.

Diagram of the synthetic data generation and auto-labeling pipeline.

1. CoT Generation Stage (The Student): The team sampled questions from the MMLU-Pro dataset, which covers 14 distinct domains including Law, Psychology, and Engineering. They fed these questions into a smaller, efficient model (Llama-3.1-8B-Instruct) and asked it to “Think step by step.”

They generated 16 different solutions for every question. Because the model is smaller, some of these solutions are correct, and some contain subtle reasoning errors—exactly what a PRM needs to learn to differentiate.

2. Auto-Labeling Stage (The Teacher): Next, they used a much stronger model (Llama-3.1-70B-Instruct) to act as the judge. This model was given a specific prompt to review the student’s reasoning.

The “Judge” model looks for the first bad step.

  • If a step is logical and correct, it gets a checkmark.
  • The moment a step introduces a factual error or a logic leap (labeled BAD), that step is marked incorrect.
  • Crucially, if a step is bad, all subsequent steps are discarded or marked as bad, because reasoning built on a false premise is invalid.

This resulted in a massive, labeled dataset called MMLU-Pro-CoT-Train. It contains over 84,000 Chain-of-Thought examples, split roughly 50/50 between fully correct solutions and solutions with errors.

Augmenting the Data

To make the PRM even more robust, the researchers didn’t just rely on the errors the student model made naturally. They also engineered specific types of errors using Counterfactual Augmentation.

Diagram of the counterfactual augmentation pipeline.

As seen in Figure 11 above, they took correct reasoning chains and asked the strong model (Llama-70B) to intentionally insert specific types of errors, such as:

  • Conflicting Steps: Contradicting previous information.
  • Non-sequiturs: Adding irrelevant information.
  • Factual Errors: Getting a date or formula wrong.

This ensures the VersaPRM learns to spot a wide variety of reasoning failures, not just the ones a specific model tends to make.

Experiments and Results

With the dataset built, the researchers trained VersaPRM (fine-tuning from a Llama base) and compared it against the best open-source Math PRMs available, such as Math-Shepherd and Qwen-2.5-Math-PRM.

1. Generalization Beyond Math

The primary question was: Does training on multi-domain data actually help?

The results were decisive. Figure 3 below compares the performance of VersaPRM (red line) against various Math PRMs and baselines.

Comparison graphs of VersaPRM vs Math PRMs across Math, Math-Adjacent, and Non-Math domains.

In the top row (Weighted Majority Voting) and bottom row (Best-of-N), look at the column on the far right: Non-Math-Adjacent Domains (like Law and History).

  • The Math PRMs (blue and orange lines) barely hug the baseline. They offer almost no help.
  • VersaPRM (red line) shows a clear, upward trajectory. As you generate more solutions (moving right on the x-axis), VersaPRM becomes increasingly effective at identifying the correct answer.

Even in the Math column (far left), VersaPRM outperforms the dedicated Math PRMs. This suggests that learning to reason in general domains might actually have positive transfer effects on mathematical reasoning, or at least that the diverse training data prevents overfitting to specific question formats.

2. The Power of Diversity

Is the success just because the model saw MMLU-Pro questions during training? Or is it learning reasoning?

To test this, the researchers ran an ablation study comparing:

  • VersaPRM (Math subset): Trained only on the math questions from their new dataset.
  • VersaPRM (Random subset): Trained on a random mix of all domains (same total size).

Comparison of training on math-only subset vs. random diverse subset.

Figure 4 shows the result. The Random subset (purple/pink line) significantly outperforms the Math-only subset, even on Math tasks (top left). This confirms that domain diversity is integral to training a robust PRM. The model isn’t just memorizing facts; it’s learning the underlying structure of a valid argument, which looks similar whether you are arguing a legal precedent or solving for \(x\).

3. Hold-Out Analysis

To be absolutely sure the model wasn’t just memorizing specific domains, they performed a “Hold-out” experiment. They trained VersaPRM while deliberately excluding specific subjects (like Law or Biology) from the training data, and then tested the model on those excluded subjects.

Hold-out domain ablation results.

Figure 24 (and Figure 5 in the paper) demonstrates that even when the model hasn’t seen training data for a specific domain (like Biology or Law), it still outperforms the baseline and Math PRMs. This is strong evidence of generalized reasoning capability.

4. Advanced Search Algorithms

The paper also investigated using VersaPRM with more complex search algorithms like Beam Search and Monte Carlo Tree Search (MCTS).

In these methods, the AI explores a “tree” of possibilities, using the PRM to decide which branch to follow.

Comparison of Beam Search and MCTS using VersaPRM.

Figure 23 highlights that using VersaPRM (red triangles for MCTS, green diamonds for Beam Search) consistently yields higher accuracy than using a math-based PRM (blue/brown lines) across diverse fields like Biology, Philosophy, and Computer Science.

5. Does this help “Smart” Models?

A common critique of PRMs is that they might only help smaller, weaker models. If we use a state-of-the-art reasoning model like DeepSeek-R1, does VersaPRM still add value?

Performance of VersaPRM on DeepSeek-R1 generated solutions.

Figure 8 shows the results on Law and Philosophy tasks using DeepSeek-R1. Even with this powerful model, VersaPRM (pink line) improves performance over simple Majority Voting (black line) and Math PRMs (red line). This proves that external verification remains valuable even as generative models get smarter.

Conclusion

The VersaPRM paper marks a significant step forward in making AI reasoning robust across the board. By moving beyond the “Math-only” paradigm of Process Reward Models, the researchers have shown that reasoning is a transferable skill.

Key takeaways for students and practitioners:

  1. Synthetic Data Works: You don’t always need humans to label data. A strong model (Teacher) can effectively label the work of a weak model (Student) to create high-quality training sets.
  2. Weakest Link Aggregation: When scoring a step-by-step solution, the best metric is often the score of the worst step (Min-Aggregation).
  3. Diversity Matters: Training a verifier on Law and Biology makes it better at Math. Exposure to diverse logic structures prevents overfitting and builds a better general reasoner.
  4. Test-Time Compute: The trend in AI is moving toward “thinking longer” rather than just “bigger models.” VersaPRM enables models to think longer and more effectively in every domain, not just STEM.

As AI integrates further into fields like legal analysis, medical diagnosis, and scientific research, tools like VersaPRM will be essential safeguards to ensure that the reasoning behind an answer is just as correct as the answer itself.