Large Language Models (LLMs) have become remarkably adept at complex reasoning through a technique called Chain-of-Thought (CoT)—a step-by-step process where the model generates discrete reasoning tokens much like how a person thinks through a problem. However, this approach is inherently rigid: at each step, the model must commit to a single token from its vocabulary.
What if that very rigidity limits its ability to explore abstract or nuanced ideas? What if reasoning could flow more fluidly across possibilities, rather than snapping to one discrete choice at a time?
That question inspires the soft-thinking paradigm—a mode of reasoning where an LLM uses continuous “soft tokens,” each representing a weighted average of multiple possible token embeddings. This approach lets the model hold multiple hypotheses simultaneously, opening doors to richer, more flexible reasoning.
But soft-thinking comes with a catch. Reinforcement Learning (RL), which has propelled discrete token reasoning to new heights, performs poorly when applied to soft tokens. Algorithms like Group Relative Policy Optimization (GRPO), which reward successful reasoning trajectories, cannot easily handle the continuous stochastic space of soft-thinking.
Enter SofT-GRPO, a breakthrough algorithm that successfully extends GRPO into the soft-thinking world. By injecting controlled randomness through Gumbel noise and leveraging the Gumbel reparameterization trick, SofT-GRPO achieves what earlier approaches could not—it enables LLMs trained on soft-thinking to outperform their discrete-token counterparts.
In this deep dive, we’ll unpack how SofT-GRPO reshapes reinforcement learning for LLMs, covering:
- How discrete-token GRPO works and its limitations
- Why previous RL attempts for soft-thinking failed
- How SofT-GRPO uses Gumbel noise and reparameterization to overcome these barriers
- The experimental results showing SofT-GRPO’s superior accuracy and robustness
Let’s start with how reasoning traditionally works in discrete space.
From Discrete Steps to Soft Thoughts
The World of Discrete Tokens: CoT and GRPO
When given a problem—say a math question—an LLM does not simply guess the answer. It first produces a Chain-of-Thought, \( \mathbf{R} = (r_1, r_2, \ldots, r_T) \), a token-by-token reasoning sequence leading to the final answer \( \mathbf{A} \). Each token comes from its vocabulary, and the probability of generating the entire sequence is the product of probabilities of generating each token conditioned on the previous ones.

“Illustration of how discrete Chain-of-Thought tokens are generated step-by-step from model probabilities.”
To enhance this reasoning process, Reinforcement Learning with Verifiable Rewards (RLVR) is used. The objective is simple: let the model try multiple reasoning paths and reward correct ones. A leading algorithm here is Group Relative Policy Optimization (GRPO).
GRPO generates a group of \( G \) reasoning trajectories for each query through multinomial sampling, exploring different reasoning paths that lead to varied rewards.

“Group rollout in GRPO: multiple discrete CoT trajectories are sampled to encourage diverse exploration.”
Each trajectory receives a reward—such as 1 if the answer is correct, 0 otherwise. GRPO computes an advantage that measures how a trajectory’s performance compares to its peers. The model increases the probability of higher-advantage trajectories and decreases that of lower ones through a gradient-based update.

“The GRPO objective balances exploration and exploitation via trajectory advantages and clipping.”
This policy optimization delivers powerful performance boosts—but it depends on explicit token-level probabilities. Once we replace discrete tokens with continuous soft ones, this framework breaks down.
The Soft-Thinking Paradigm
Soft-thinking abandons the idea of picking a single word at each step. Instead, it constructs a “soft token” \( s_t \) as a weighted sum of token embeddings, where the weights correspond to their predicted probabilities.

“A soft token \( s_t \) is computed as the expectation of token embeddings under their probability distribution.”
Formally:
\[ s_t = \sum_i p_i e_i, \]where \( e_i \in \mathbb{R}^d \) is the embedding of token i and \( p_i \in [0,1] \) its probability. This continuous representation lets the model express vague or mixed concepts invisible to simple tokenization—a way to “think in between words.”
To add meaningful variability, researchers turn to the Gumbel-Softmax technique, which inserts random noise into the log-probabilities before forming the weighted sum. This adds stochasticity while maintaining a valid token mixture.

“Gumbel-Softmax introduces controlled randomness, enabling the model to explore multiple soft reasoning paths.”
Yet even with Gumbel-Softmax, applying RL directly to these soft tokens remains extremely difficult. The challenge lies in attributing continuous rewards back to the token probabilities correctly.
The Bottleneck: Why RL Struggled with Soft-Thinking
A prior approach by Butt et al. (2025) introduced Gaussian noise directly to the embedding vector \( s_t \), creating a perturbed token \( \hat{s}_t \).

“Soft token perturbation by adding Gaussian noise for exploration.”
The policy update then computed gradients based on the distance between \( s_t \) and \( \hat{s}_t \).

“Gradient update from Gaussian-based noise perturbation.”
But two fundamental issues arise:
The mismatch problem: the mapping from probabilities \( p_i \) to embeddings \( s_t \) is not one-to-one. Multiple distributions can yield identical \( s_t \), making it unclear which probabilities caused a successful outcome. Noise at the embedding level disrupts the relationship between reward and probability.
The “out-of-space” problem: valid soft tokens lie within a specific convex region (the simplex hull of token embeddings). Adding random Gaussian noise usually pushes \( \hat{s}_t \) outside this region, yielding invalid or incomprehensible model inputs.
These limitations cripple learning—updates become misaligned with the model’s probability space, producing unstable performance. The solution must preserve the probabilistic structure while introducing controlled exploration.
Inside SofT-GRPO: Reinforcing Soft-Thinking Correctly
SofT-GRPO presents an elegant fix by integrating Gumbel randomness directly into the policy optimization process. It operates through two complementary stages:
- Group rollout with Gumbel noise (exploration)
- Policy update via Gumbel reparameterization (learning)

“Overview of SofT-GRPO: Gumbel-based rollout followed by Gumbel reparameterized policy update.”
Step 1: Group Rollout with Gumbel Noise
Like GRPO, SofT-GRPO rolls out \( G \) reasoning trajectories per query. At each reasoning step, instead of sampling discrete tokens, it uses Gumbel-Softmax to form stochastic soft tokens.
First, the model computes token probabilities \( p_i \), then injects random noise \( \epsilon_i \sim \text{Gumbel}(0,1) \) into the log-probabilities:
\[ g'_i = \log p_i + \epsilon_i. \]The noisy logits are normalized to produce the final soft mixture:
\[ y'_i = \frac{\exp(g'_i / \tau_g)}{\sum_j \exp(g'_j / \tau_g)}, \quad s_t = \sum_i y'_i e_i. \]
“Rollout equations for SofT-GRPO showing how Gumbel noise modifies log-probabilities.”
This technique relies on the Gumbel-Max Trick—a statistical result that ensures that adding Gumbel noise to logits preserves the original categorical distribution:

“The Gumbel-Max Trick validates that Gumbel noise induces sampling consistent with the underlying probabilities.”
In short, SofT-GRPO achieves stochastic sampling without invalid embeddings. The model can now explore many reasoning paths naturally within its trained space.
Step 2: Policy Update with Gumbel Reparameterization
Once the rollout trajectories receive rewards, SofT-GRPO performs policy optimization to reinforce successful reasoning.
Here lies its most innovative component: the Gumbel reparameterization trick. Instead of working in embedding space, it reformulates gradients in terms of sampled Gumbel noise, whose distribution is analytically differentiable.
Under the old policy \( \pi_{\theta_{\text{old}}} \):
\[ \log p(s_t | Q, S_{
“Log-probability under the old policy expressed via Gumbel noise.”
Under the new policy \( \pi_\theta \):
\[ \log p(s_t | Q, S_{
“Policy update: reparameterization allows differentiable probability estimation for soft tokens.”
These log-probabilities are then used to construct an updated GRPO-style loss function that includes both soft-thinking and discrete components.

“SofT-GRPO loss combines discrete GRPO objectives with soft-thinking gradients derived via Gumbel reparameterization.”
The advantage of this approach is precision: improvements in reward directly update the underlying token probabilities rather than ambiguous embedding vectors. SofT-GRPO thereby resolves the core attribution problem that limited previous RL methods for soft-thinking.
Experiments: Putting SofT-GRPO to the Test
The paper benchmarks SofT-GRPO across three base LLMs—DeepSeek-R1-Distill-Qwen-1.5B, LLaMA-3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-7B—on five mathematical reasoning datasets: AIME2024, AIME2025, AMC23, MATH-500, and GSM8K.
Main Findings
SofT-GRPO consistently outperforms discrete-token GRPO on all models and datasets.

“Across metrics Pass@1, Pass@16, and Pass@32, SofT-GRPO shows clear gains over discrete GRPO.”
- Pass@1: +0.13% average improvement (single-shot accuracy)
- Pass@16: +1.80% average gain
- Pass@32: +2.19% average gain
These results indicate that SofT-GRPO doesn’t just improve first-try accuracy—it greatly enhances diversity and robustness. The model explores more reasoning paths successfully, meaning that with multiple attempts, correct answers are much more likely.
Comparison to Prior Work and Generalization
SofT-GRPO easily surpasses the Gaussian-based RL model proposed by Butt et al. (2025). That earlier method loses alignment between tokens and embeddings, leading to minimal gains.

“Comparison with prior soft-token RL method: SofT-GRPO leads by substantial margins.”
SofT-GRPO also generalizes well beyond math tasks. When evaluated on scientific and code reasoning benchmarks—GPQA Diamond, HumanEval, and MBPP—the improvements continue.

“Generalization to out-of-domain tasks highlights SofT-GRPO’s broader reasoning advantage.”
This demonstrates that SofT-GRPO fine-tuning develops general reasoning improvements, not narrow task-specific skills.
Harnessing High Pass Rates: Majority Voting
The higher multi-pass accuracies make majority voting practical: when a model generates many candidate answers, choose the most common output as final. SofT-GRPO fine-tuned models perform best under this regime.

“Integrating SofT-GRPO with majority voting further amplifies reliability in multi-attempt reasoning.”
Ablation Studies: Why Gumbel Matters
To validate its design, the authors compare SofT-GRPO to variants that use Dirichlet or Gaussian noise in place of Gumbel. None perform as well.

“Ablation results: Gumbel noise yields stronger accuracy and stability than alternatives.”
Training reward curves confirm the difference: Gumbel-based training achieves smoother and faster convergence.

“Training and validation curves reveal that Gumbel noise ensures consistent learning progress.”
Further analyses of hyperparameters show that overly high top-p or temperature settings lead to divergence; maintaining top-p=0.95 and τ_g=0.1 keeps optimization stable.

“KL divergence plots verify stability across SofT-GRPO’s tuned parameters.”
Conclusion: A New Way to Teach Machines to Think
SofT-GRPO represents a milestone in reasoning-oriented reinforcement learning for LLMs. Where previous methods failed to handle continuous soft tokens, SofT-GRPO succeeds through elegant mathematical integration of Gumbel-Softmax and reparameterization.
The implications are significant:
- Bridging discrete and continuous reasoning: SofT-GRPO provides the first robust reinforcement learning framework for soft-thinking LLMs.
- Controlled exploration: Gumbel noise allows stochastic exploration that respects model probability distributions.
- Precise policy updates: Gumbel reparameterization enables accurate gradient attribution and stable fine-tuning.
- Better diversity, better results: Superior Pass@k metrics reveal a model that not only reasons well once, but diversely across attempts.
- Toward continuous intelligence: Reinforcing soft-thinking paves the path for LLMs that think in fluid concept spaces—beyond rigid tokens and into more human-like abstraction.
As LLM research moves forward, SofT-GRPO may stand as a foundational technique for reasoning that truly grows beyond words—unlocking models that think softly, but reason sharply.
](https://deep-paper.org/en/paper/2511.06411/images/cover.png)