](https://deep-paper.org/en/papers/2025-10/2509.20357/images/cover.png)
Beyond Math Puzzles: How Teaching LLMs to 'Think' Unlocks Superior Chat Performance
Introduction: The Power of Thinking Before You Speak We’ve all heard the advice, “think before you speak.” It’s a core aspect of human intelligence—the ability to pause, reason through the consequences, and formulate a thoughtful response. Nobel laureate Daniel Kahneman described this reflective, deliberate process as System 2 thinking: the kind of mental effort that distinguishes a knee-jerk reaction from a reasoned argument. For much of their existence, Large Language Models (LLMs) have operated more like System 1 thinkers: remarkably fast, impressively fluent, but too often shallow in reasoning. Recent research has sought to change that by teaching models to “think” before answering, using a strategy called Reinforcement Learning with Verifiable Rewards (RLVR). In RLVR, a model generates a long chain of thought (CoT) before producing its answer, and earns a reward when the final answer can be automatically verified as correct. This works extremely well in math and code—where correctness is objective. If the math checks out or the code passes all the unit tests, the model gets rewarded. ...