Introduction: The Power of Thinking Before You Speak
We’ve all heard the advice, “think before you speak.” It’s a core aspect of human intelligence—the ability to pause, reason through the consequences, and formulate a thoughtful response. Nobel laureate Daniel Kahneman described this reflective, deliberate process as System 2 thinking: the kind of mental effort that distinguishes a knee-jerk reaction from a reasoned argument.
For much of their existence, Large Language Models (LLMs) have operated more like System 1 thinkers: remarkably fast, impressively fluent, but too often shallow in reasoning. Recent research has sought to change that by teaching models to “think” before answering, using a strategy called Reinforcement Learning with Verifiable Rewards (RLVR). In RLVR, a model generates a long chain of thought (CoT) before producing its answer, and earns a reward when the final answer can be automatically verified as correct. This works extremely well in math and code—where correctness is objective. If the math checks out or the code passes all the unit tests, the model gets rewarded.
But human conversation is messier. How would you “verify” the correctness of a meal plan, an essay outline, or an imagined philosophical treatise from The Shawshank Redemption? Skills learned from solving math problems don’t transfer cleanly to these subjective, creative tasks. In practice, RLVR-trained models often lag behind standard chatbots when judged on general conversation.
That’s the problem addressed in a new paper from Princeton University: “LANGUAGE MODELS THAT THINK, CHAT BETTER”. The researchers introduce a surprisingly simple yet potent approach called Reinforcement Learning with Model-rewarded Thinking (RLMT). By combining RLVR-style chain-of-thought reasoning with RLHF-style preference rewards, RLMT boosts conversational ability dramatically. Their best 8-billion-parameter model not only beats models ten times larger but also rivals industry giants like GPT-4o and Claude-3.7-Sonnet in chat and creative writing.
Let’s explore how they did it.
Background: Two Schools of LLM Alignment
RLMT builds on two major paradigms: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR).
RLHF: The Art of Conversation
RLHF is the cornerstone of most top-tier chatbots. It aligns an LLM’s outputs with human values and preferences:
- Humans judge two or more responses to the same prompt, ranking them from best to worst.
- This preference data trains a reward model—a separate model that predicts which response a human would prefer.
- The original LLM is then fine-tuned with reinforcement learning to maximize the reward model’s scores.
Mathematically, RLHF maximizes:
\[ \max_{\theta} \; \mathbb{E}_{x \sim \mathcal{X}} \left[ \mathbb{E}_{y \sim \pi_{\theta}(\cdot|x)} r(x, y) \right] \]The reward signal is qualitative and subjective—good for open-ended conversational tasks. However, RLHF treats the output as a single block and doesn’t explicitly encourage structured reasoning before answering.
RLVR: The Science of Correctness
RLVR specializes in domains with clear, objective correctness: math, code, logic puzzles. The model produces a chain-of-thought \(z\) followed by a final answer \(y\). A rule-based verifier checks the answer against ground truth:
\[ \max_{\theta} \; \mathbb{E}_{x \sim \mathcal{X}} \left[ \mathbb{E}_{(y,z) \sim \pi_{\theta}(\cdot|x)} \mathbb{1}\{y = y^*\} \right] \]It’s highly effective in formal domains—models like DeepSeek-R1 excel here—but these skills don’t generalize well to everyday chat.
Figure 3 from the paper: models trained to “think” only in verifiable domains like math score far below the authors’ RLMT model on the WildBench chat benchmark.
The Core Method: RL with Model-Rewarded Thinking (RLMT)
RLMT fuses RLHF’s flexible supervision with RLVR’s explicit reasoning. The idea: make the model think out loud for any prompt, and use a reward model to judge the quality of the final response.
Figure 1: RLMT extends the chain-of-thought process to general tasks, using a preference-trained reward model rather than rule-based verification.
Formally:
\[ \max_{\theta} \; \mathbb{E}_{x \sim \mathcal{X}} \left[ \mathbb{E}_{(y,z) \sim \pi_{\theta}(\cdot|x)} r(y, x) \right] \]The model generates both \(z\) (reasoning trace) and \(y\) (final response), but the reward \(r\) is from a preference model.
Here’s an example of the kind of detailed reasoning RLMT produces:
Figure 2: reasoning trace for a query about Andy Dufresne’s hypothetical philosophy. The model reviews the story, synthesizes core themes, and structures the response.
The RLMT Recipe
The authors identify key ingredients:
- Training Algorithm: They tested Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO)—finding GRPO most effective. GRPO compares groups of responses to the same prompt, moving the model toward better-than-average ones.
- Reward Model: Supervisor quality matters. They used a strong public model (Skywork-v1), and ablation studies confirmed that weaker reward models sharply reduce RLMT’s gains.
- Prompt Mixture: Training data shapes capabilities. RLMT’s prompts came from WildChat-IF, 7.5k diverse, challenging real-world queries—a better match for general usage than math-heavy or overly simple datasets.
Teaching a Model to Think
Base LLMs don’t naturally output <think>...</think>
reasoning blocks. The team explored two approaches:
- Warm-Start SFT: Using Supervised Fine-Tuning with data distilled from Gemini 2.5 Flash. The teacher model’s outputs included reasoning traces, teaching smaller models the desired format before RL.
- “Zero” Training: Skipping SFT entirely. A carefully crafted prompt elicited thought processes directly, allowing RLMT to be applied to base models.
Experiments and Results: Thinking Pays Off
Over 40 training runs across Llama-3.1-8B and Qwen-2.5-7B, in both warm-start and zero-training configurations, yielded consistent gains.
Thinking Models Excel at Chat
Table 1: GRPO results show RLMT models outperform non-thinking RLHF counterparts, especially on chat benchmarks.
Example: Warm-started Llama-3.1-8B-Instruct-RLMT scores 44.0 AvgChat vs. 35.8 for its RLHF baseline—a massive +8.2 points.
An 8B Model Punching Above Its Weight
Table 2: the best 8B RLMT model beats Llama-3.1-70B and Qwen-2.5-72B, outperforms GPT-4o on WildBench and CWv3, and is competitive with Claude-3.7-Sonnet.
A small, open-source 8B model, trained with RLMT on just 7k prompts, matches or beats state-of-the-art 70B+ models in key conversational benchmarks.
“Zero” Training Challenges the Status Quo
In the zero-training setup, Qwen-2.5-7B-RLMT-Zero scores 29.0 AvgChat, comfortably surpassing the fully instruction-tuned Qwen-2.5-7B-Instruct (23.1)—despite the latter’s multi-stage, millions-of-examples pipeline.
What Makes a Good Thinker? Ablations
Table 4: WildChat-IF prompts outperform others. Strong reward models like SkyworkV2 boost RLMT’s gains; weak ones harm them.
Findings:
- Prompts Matter: WildChat-IF > UltraFeedback or Tülu3-Random.
- Reward Model’s Strength Is Critical: SkyworkV2 > ArmoRM.
Analysis: Inside a Thinking Model
The team examined planning styles before vs. after RLMT. The shift was striking: from rigid, linear checklists to flexible, iterative planning.
Figure 4: After RLMT, traits like grouping ideas, integrating constraints, and weighing trade-offs increased; reliance on checklists decreased.
Post-RLMT, the model:
- Groups ideas into themes.
- Integrates constraints into plans.
- Weighs perspectives.
- Iteratively refines plans.
Figure 5: In zero-training, thought and response lengths increased steadily over RLMT training—indicating deeper reasoning.
Conclusion and Implications
The LANGUAGE MODELS THAT THINK, CHAT BETTER paper makes a strong case for merging RLVR’s explicit reasoning with RLHF’s flexible supervision. Key takeaways:
- Thinking Helps Everywhere: Chain-of-thought isn’t just for math—it improves open-ended dialogue and writing.
- Small Models Can Be Mighty: An 8B model with the right recipe can challenge or beat models an order of magnitude larger.
- Simpler Training Pipelines: Zero-training RLMT can yield competitive models without extensive SFT and multi-stage post-training.
By teaching models how to think, not just what to say, RLMT points toward a new generation of LLMs—more capable, deliberate, and structured in their reasoning. Future work could refine thought formats and reward models to evaluate reasoning quality directly. In the meantime, RLMT demonstrates that even modest models can achieve heavyweight performance when they’re trained to think before they speak.