Large Language Models (LLMs) have become remarkably adept at complex reasoning through a technique called Chain-of-Thought (CoT)—a step-by-step process where the model generates discrete reasoning tokens much like how a person thinks through a problem. However, this approach is inherently rigid: at each step, the model must commit to a single token from its vocabulary.

What if that very rigidity limits its ability to explore abstract or nuanced ideas? What if reasoning could flow more fluidly across possibilities, rather than snapping to one discrete choice at a time?

That question inspires the soft-thinking paradigm—a mode of reasoning where an LLM uses continuous “soft tokens,” each representing a weighted average of multiple possible token embeddings. This approach lets the model hold multiple hypotheses simultaneously, opening doors to richer, more flexible reasoning.

But soft-thinking comes with a catch. Reinforcement Learning (RL), which has propelled discrete token reasoning to new heights, performs poorly when applied to soft tokens. Algorithms like Group Relative Policy Optimization (GRPO), which reward successful reasoning trajectories, cannot easily handle the continuous stochastic space of soft-thinking.

Enter SofT-GRPO, a breakthrough algorithm that successfully extends GRPO into the soft-thinking world. By injecting controlled randomness through Gumbel noise and leveraging the Gumbel reparameterization trick, SofT-GRPO achieves what earlier approaches could not—it enables LLMs trained on soft-thinking to outperform their discrete-token counterparts.

In this deep dive, we’ll unpack how SofT-GRPO reshapes reinforcement learning for LLMs, covering:

  • How discrete-token GRPO works and its limitations
  • Why previous RL attempts for soft-thinking failed
  • How SofT-GRPO uses Gumbel noise and reparameterization to overcome these barriers
  • The experimental results showing SofT-GRPO’s superior accuracy and robustness

Let’s start with how reasoning traditionally works in discrete space.


From Discrete Steps to Soft Thoughts

The World of Discrete Tokens: CoT and GRPO

When given a problem—say a math question—an LLM does not simply guess the answer. It first produces a Chain-of-Thought, \( \mathbf{R} = (r_1, r_2, \ldots, r_T) \), a token-by-token reasoning sequence leading to the final answer \( \mathbf{A} \). Each token comes from its vocabulary, and the probability of generating the entire sequence is the product of probabilities of generating each token conditioned on the previous ones.

The generation probability for a discrete Chain-of-Thought.

“Illustration of how discrete Chain-of-Thought tokens are generated step-by-step from model probabilities.”

To enhance this reasoning process, Reinforcement Learning with Verifiable Rewards (RLVR) is used. The objective is simple: let the model try multiple reasoning paths and reward correct ones. A leading algorithm here is Group Relative Policy Optimization (GRPO).

GRPO generates a group of \( G \) reasoning trajectories for each query through multinomial sampling, exploring different reasoning paths that lead to varied rewards.

A diagram showing how multiple discrete Chain-of-Thought paths are sampled for a single query.

“Group rollout in GRPO: multiple discrete CoT trajectories are sampled to encourage diverse exploration.”

Each trajectory receives a reward—such as 1 if the answer is correct, 0 otherwise. GRPO computes an advantage that measures how a trajectory’s performance compares to its peers. The model increases the probability of higher-advantage trajectories and decreases that of lower ones through a gradient-based update.

The loss function for the standard Group Relative Policy Optimization (GRPO) algorithm.

“The GRPO objective balances exploration and exploitation via trajectory advantages and clipping.”

This policy optimization delivers powerful performance boosts—but it depends on explicit token-level probabilities. Once we replace discrete tokens with continuous soft ones, this framework breaks down.


The Soft-Thinking Paradigm

Soft-thinking abandons the idea of picking a single word at each step. Instead, it constructs a “soft token” \( s_t \) as a weighted sum of token embeddings, where the weights correspond to their predicted probabilities.

Equation showing how a soft token is calculated as the weighted sum of all token embeddings.

“A soft token \( s_t \) is computed as the expectation of token embeddings under their probability distribution.”

Formally:

\[ s_t = \sum_i p_i e_i, \]

where \( e_i \in \mathbb{R}^d \) is the embedding of token i and \( p_i \in [0,1] \) its probability. This continuous representation lets the model express vague or mixed concepts invisible to simple tokenization—a way to “think in between words.”

To add meaningful variability, researchers turn to the Gumbel-Softmax technique, which inserts random noise into the log-probabilities before forming the weighted sum. This adds stochasticity while maintaining a valid token mixture.

The Gumbel-Softmax technique for introducing stochasticity into the soft-thinking process.

“Gumbel-Softmax introduces controlled randomness, enabling the model to explore multiple soft reasoning paths.”

Yet even with Gumbel-Softmax, applying RL directly to these soft tokens remains extremely difficult. The challenge lies in attributing continuous rewards back to the token probabilities correctly.


The Bottleneck: Why RL Struggled with Soft-Thinking

A prior approach by Butt et al. (2025) introduced Gaussian noise directly to the embedding vector \( s_t \), creating a perturbed token \( \hat{s}_t \).

Equation showing the addition of Gaussian noise to a soft token.

“Soft token perturbation by adding Gaussian noise for exploration.”

The policy update then computed gradients based on the distance between \( s_t \) and \( \hat{s}_t \).

The gradient calculation for the Gaussian noise-based soft-thinking RL method.

“Gradient update from Gaussian-based noise perturbation.”

But two fundamental issues arise:

  1. The mismatch problem: the mapping from probabilities \( p_i \) to embeddings \( s_t \) is not one-to-one. Multiple distributions can yield identical \( s_t \), making it unclear which probabilities caused a successful outcome. Noise at the embedding level disrupts the relationship between reward and probability.

  2. The “out-of-space” problem: valid soft tokens lie within a specific convex region (the simplex hull of token embeddings). Adding random Gaussian noise usually pushes \( \hat{s}_t \) outside this region, yielding invalid or incomprehensible model inputs.

These limitations cripple learning—updates become misaligned with the model’s probability space, producing unstable performance. The solution must preserve the probabilistic structure while introducing controlled exploration.


Inside SofT-GRPO: Reinforcing Soft-Thinking Correctly

SofT-GRPO presents an elegant fix by integrating Gumbel randomness directly into the policy optimization process. It operates through two complementary stages:

  1. Group rollout with Gumbel noise (exploration)
  2. Policy update via Gumbel reparameterization (learning)

The complete pipeline of the SofT-GRPO algorithm, from group rollout with Gumbel noise to policy optimization with Gumbel reparameterization.

“Overview of SofT-GRPO: Gumbel-based rollout followed by Gumbel reparameterized policy update.”


Step 1: Group Rollout with Gumbel Noise

Like GRPO, SofT-GRPO rolls out \( G \) reasoning trajectories per query. At each reasoning step, instead of sampling discrete tokens, it uses Gumbel-Softmax to form stochastic soft tokens.

First, the model computes token probabilities \( p_i \), then injects random noise \( \epsilon_i \sim \text{Gumbel}(0,1) \) into the log-probabilities:

\[ g'_i = \log p_i + \epsilon_i. \]

The noisy logits are normalized to produce the final soft mixture:

\[ y'_i = \frac{\exp(g'_i / \tau_g)}{\sum_j \exp(g'_j / \tau_g)}, \quad s_t = \sum_i y'_i e_i. \]

The equations for the SofT-GRPO rollout process, incorporating Gumbel noise into the log probabilities to generate a stochastic soft token.

“Rollout equations for SofT-GRPO showing how Gumbel noise modifies log-probabilities.”

This technique relies on the Gumbel-Max Trick—a statistical result that ensures that adding Gumbel noise to logits preserves the original categorical distribution:

The Gumbel-Max Trick, showing that adding Gumbel noise to logits is equivalent to sampling from the original distribution.

“The Gumbel-Max Trick validates that Gumbel noise induces sampling consistent with the underlying probabilities.”

In short, SofT-GRPO achieves stochastic sampling without invalid embeddings. The model can now explore many reasoning paths naturally within its trained space.


Step 2: Policy Update with Gumbel Reparameterization

Once the rollout trajectories receive rewards, SofT-GRPO performs policy optimization to reinforce successful reasoning.

Here lies its most innovative component: the Gumbel reparameterization trick. Instead of working in embedding space, it reformulates gradients in terms of sampled Gumbel noise, whose distribution is analytically differentiable.

Under the old policy \( \pi_{\theta_{\text{old}}} \):

\[ \log p(s_t | Q, S_{The log probability of a soft token under the old policy, expressed as a function of the Gumbel noise.

“Log-probability under the old policy expressed via Gumbel noise.”

Under the new policy \( \pi_\theta \):

\[ \log p(s_t | Q, S_{The log probability of a soft token under the new policy, calculated using the Gumbel reparameterization trick.

“Policy update: reparameterization allows differentiable probability estimation for soft tokens.”

These log-probabilities are then used to construct an updated GRPO-style loss function that includes both soft-thinking and discrete components.

The full loss function for SofT-GRPO, which combines the GRPO framework with the Gumbel reparameterization trick for the soft-thinking steps.

“SofT-GRPO loss combines discrete GRPO objectives with soft-thinking gradients derived via Gumbel reparameterization.”

The advantage of this approach is precision: improvements in reward directly update the underlying token probabilities rather than ambiguous embedding vectors. SofT-GRPO thereby resolves the core attribution problem that limited previous RL methods for soft-thinking.


Experiments: Putting SofT-GRPO to the Test

The paper benchmarks SofT-GRPO across three base LLMs—DeepSeek-R1-Distill-Qwen-1.5B, LLaMA-3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-7B—on five mathematical reasoning datasets: AIME2024, AIME2025, AMC23, MATH-500, and GSM8K.

Main Findings

SofT-GRPO consistently outperforms discrete-token GRPO on all models and datasets.

Table 1 from the paper, showing the main experimental results. SofT-GRPO consistently outperforms discrete-token GRPO, especially on Pass@16 and Pass@32 metrics.

“Across metrics Pass@1, Pass@16, and Pass@32, SofT-GRPO shows clear gains over discrete GRPO.”

  • Pass@1: +0.13% average improvement (single-shot accuracy)
  • Pass@16: +1.80% average gain
  • Pass@32: +2.19% average gain

These results indicate that SofT-GRPO doesn’t just improve first-try accuracy—it greatly enhances diversity and robustness. The model explores more reasoning paths successfully, meaning that with multiple attempts, correct answers are much more likely.


Comparison to Prior Work and Generalization

SofT-GRPO easily surpasses the Gaussian-based RL model proposed by Butt et al. (2025). That earlier method loses alignment between tokens and embeddings, leading to minimal gains.

Table 2 comparing SofT-GRPO to a previous RLVR method for soft-thinking, showing SofT-GRPO’s clear superiority.

“Comparison with prior soft-token RL method: SofT-GRPO leads by substantial margins.”

SofT-GRPO also generalizes well beyond math tasks. When evaluated on scientific and code reasoning benchmarks—GPQA Diamond, HumanEval, and MBPP—the improvements continue.

Table 3 showing out-of-domain results. The benefits of SofT-GRPO training transfer to scientific and code reasoning tasks.

“Generalization to out-of-domain tasks highlights SofT-GRPO’s broader reasoning advantage.”

This demonstrates that SofT-GRPO fine-tuning develops general reasoning improvements, not narrow task-specific skills.


Harnessing High Pass Rates: Majority Voting

The higher multi-pass accuracies make majority voting practical: when a model generates many candidate answers, choose the most common output as final. SofT-GRPO fine-tuned models perform best under this regime.

Table 4 demonstrating that combining SofT-GRPO with majority voting further boosts its performance.

“Integrating SofT-GRPO with majority voting further amplifies reliability in multi-attempt reasoning.”


Ablation Studies: Why Gumbel Matters

To validate its design, the authors compare SofT-GRPO to variants that use Dirichlet or Gaussian noise in place of Gumbel. None perform as well.

Table 5 from the ablation study, confirming that Gumbel noise is more effective than Dirichlet or Gaussian noise for this task.

“Ablation results: Gumbel noise yields stronger accuracy and stability than alternatives.”

Training reward curves confirm the difference: Gumbel-based training achieves smoother and faster convergence.

Training and validation reward curves from the ablation studies. The Gumbel noise variant shows the most stable learning trajectory.

“Training and validation curves reveal that Gumbel noise ensures consistent learning progress.”

Further analyses of hyperparameters show that overly high top-p or temperature settings lead to divergence; maintaining top-p=0.95 and τ_g=0.1 keeps optimization stable.

KL divergence curves showing that the chosen hyperparameters for SofT-GRPO prevent the model from diverging too far from its pre-trained state during fine-tuning.

“KL divergence plots verify stability across SofT-GRPO’s tuned parameters.”


Conclusion: A New Way to Teach Machines to Think

SofT-GRPO represents a milestone in reasoning-oriented reinforcement learning for LLMs. Where previous methods failed to handle continuous soft tokens, SofT-GRPO succeeds through elegant mathematical integration of Gumbel-Softmax and reparameterization.

The implications are significant:

  1. Bridging discrete and continuous reasoning: SofT-GRPO provides the first robust reinforcement learning framework for soft-thinking LLMs.
  2. Controlled exploration: Gumbel noise allows stochastic exploration that respects model probability distributions.
  3. Precise policy updates: Gumbel reparameterization enables accurate gradient attribution and stable fine-tuning.
  4. Better diversity, better results: Superior Pass@k metrics reveal a model that not only reasons well once, but diversely across attempts.
  5. Toward continuous intelligence: Reinforcing soft-thinking paves the path for LLMs that think in fluid concept spaces—beyond rigid tokens and into more human-like abstraction.

As LLM research moves forward, SofT-GRPO may stand as a foundational technique for reasoning that truly grows beyond words—unlocking models that think softly, but reason sharply.