When faced with a tough problem, what do you do? You might brainstorm a few different approaches, weigh their pros and cons, and then combine the best parts of each to forge a final, solid solution. It turns out we can teach Large Language Models (LLMs) to do something very similar — and it dramatically improves their ability to solve complex reasoning tasks.
For years, a standard strategy for boosting LLM performance on hard problems like math or coding has been to increase the “test-time compute.” Instead of asking the model for just one answer, we ask it for many. Then we pick the most common answer — a technique called self-consistency or majority voting. It’s simple, often effective, and feels intuitive: if ten different lines of reasoning all point to the answer “42,” then “42” is probably correct.
But what if the correct answer is subtle, non-obvious, and appears in only a fraction of the model’s attempts? What if the majority is wrong? In these cases, majority voting can actually amplify the model’s errors, confidently selecting an incorrect answer. Furthermore, this simple vote misses a huge opportunity: what if several incorrect solutions each contain a piece of the correct reasoning? A simple vote can’t combine these partial insights.
A recent paper from researchers at Meta AI, titled The Majority is not always right: RL training for solution aggregation, introduces a clever new approach. Instead of relying on a fixed rule like majority voting, they propose teaching an LLM the skill of aggregation itself. Their method, AGGLM, trains a model to act like an expert reviewer: it examines a set of candidate solutions, identifies strengths and weaknesses, reconciles differences, and synthesizes a final, polished answer. The results are impressive — learning to reason over multiple solutions is far more powerful than simply counting votes.
The Problem with Simple Voting
Let’s recap the standard approaches and their limitations.
Rule-Based Voting: The most common method is still majority voting. You generate, say, 32 different solutions (or “chains of thought”) and take the most frequent final answer. This baseline often works well, but it fails when the correct solution is found by only a minority of generation paths. This occurs frequently on problems where the model falls into common misconceptions.
Model-Based Selection: A more advanced technique uses another model — a “reward model” or “verifier” — to score each candidate solution. Instead of picking the most frequent answer, you pick the highest-scoring one. This can help because the reward model may better judge quality than a simple frequency count. However, these models still fail to spot unconventional but correct answers, and they only select from existing options — they cannot create a new, better one.
Both methods leave untapped potential. They can’t salvage correct steps from flawed solutions or merge complementary reasoning from two different attempts. To do that, a model must be able to read, understand, and reason about the solutions.
AGGLM: Learning to Aggregate with Reinforcement Learning
Here’s where AGGLM comes in. The core idea: treat aggregation as a reasoning task, not a heuristic.
Figure 1: The AGGLM pipeline. A standard LLM generates multiple candidate solutions. Then, a specialized Aggregator LLM (which can be the same model or a different one) reviews these candidates to produce a final synthesized answer. This aggregator is trained via reinforcement learning.
The process works in two stages:
Generate Solutions: For a problem \(x\), a standard solution model \(p_{\theta}\) generates \(m\) independent candidate solutions:
\[ y_i \sim p_{\theta}(y \mid x), \quad i \in \{1,\dots,m\} \]Aggregate Solutions: The candidates, along with the original problem, go to an aggregation model \(p_{\phi}\), which outputs a synthesized answer \(\tilde{y}\):
\[ \tilde{y} \sim p_{\phi}(y \mid x, y_{1:m}) \]
Many reasoning tasks have verifiable answers, so the authors use Reinforcement Learning from Verifiable Rewards (RLVR). The reward is simple:
- If \(\tilde{y} = y^*\) (ground-truth), reward = 1
- Else, reward = 0
Training on thousands of problems teaches the model:
- To recognize and trust a correct minority solution
- To spot and fix reasoning errors
- To combine correct bits from multiple flawed candidates into a complete, correct answer
The Secret Sauce: Balancing Easy and Hard Examples
The paper’s critical insight is how to construct training data.
If training only on cases where all solutions are correct (easy examples), the aggregator learns just to copy. Training only on sets where all candidates are wrong (hard examples) gives sparse rewards — hard to learn from.
So the authors define:
- Easy: Majority answer is correct
- Hard: Majority answer is incorrect
Hard examples drive the challenging skills (identifying minority-but-correct answers or synthesizing new ones), but easy examples provide essential reinforcement.
They found the sweet spot: all hard examples plus 50% of easy examples. This balance keeps training realistic yet challenging.
Figure 2: The prompt guiding AGGLM’s reasoning: explicitly instructing the model to review, reconcile, and synthesize.
Experimental Results: Does It Work?
The authors tested AGGLM on four competitive math datasets: AIME24, AIME25, HMMT24, and HMMT25. They trained a 1.7B parameter model (AggLM-1.7B) and compared it against strong baselines.
Head-to-Head Performance
First, they evaluated on “in-distribution” data: candidate solutions from the same Qwen3-1.7B model used in training.
Table 1: AggLM-1.7B achieves the highest accuracy across all datasets, outperforming both majority voting and reward models.
AggLM-1.7B wins decisively on all benchmarks. It beats majority voting and even reward-model selection with much larger models (e.g., AceMath-72B). On AIME25:
- Base model: 35.68%
- Majority voting: 45.89%
- AggLM-1.7B: 50.00%
This shows synthesizing beats just selecting.
Generalizing to New Challenges
The team tested AggLM in two harder setups:
- Stronger Solutions: Feeding it solutions from a stronger Qwen3-8B model.
Table 2: Aggregating solutions from a stronger model, AggLM-1.7B still outperforms all baselines — robust and transferable skills.
- Different Styles: Solutions without step-by-step reasoning (“non-thinking” mode).
Table 3: Even with only final answers and no reasoning steps, AggLM-1.7B adapts and leads.
In both scenarios, AggLM-1.7B topped the leaderboard, showing its learned reasoning skill generalizes across input quality and format.
Why AGGLM Works: Ablations & Insights
Training Mix Matters
Table 4: Balanced training mixtures (5–50% easy) consistently win. Hard-only or all-easy mixes hurt performance.
A moderate proportion of easy sets provides stability and frequent rewards, enabling the model to master difficult reasoning without stalling.
Biggest Gains in the Hardest Cases
Figure 4: AGGLM shines when there’s no strong consensus and the correct answer is rare — majority voting often fails here.
When candidate solutions are diverse and uncertain, AGGLM’s reasoning advantage matters most. If the majority size is large, both methods succeed — and AGGLM wisely doesn’t override an obviously correct majority.
Superior Scaling & Efficiency
Figure 3: AggLM scales better with more candidates. Often, 8 AggLM solutions beat 16 majority-voted solutions — using fewer tokens.
Generating solutions is expensive in compute and tokens. AggLM’s ability to outperform majority voting at lower \(k\) means less compute for more accuracy.
Conclusion: A Smarter Test-Time Compute Strategy
The AGGLM approach changes how we think about improving LLM reasoning. Rather than brute-forcing more samples and using naive voting, we can train a model to aggregate wisely.
AGGLM is:
- More Accurate: Consistently beats strong baselines.
- More Robust: Works across different models and reasoning styles.
- More Efficient: Achieves higher accuracy with fewer solutions — saving compute.
The key is treating aggregation as a first-class reasoning skill, trained with reinforcement learning on a balanced diet of problems. The central takeaway of the paper: the majority is not always right — but a model trained to reason over multiple answers can find the truth anyway.