Reinforcement Learning from Human Feedback (RLHF) is the secret sauce behind the modern revolution in Large Language Models (LLMs). It is the process that turns a raw, text-predicting model into a helpful assistant like ChatGPT, Claude, or Gemini.
The standard recipe for RLHF has been established for years: take a pre-trained model, collect human preferences (e.g., “Answer A is better than Answer B”), train a reward model, and then optimize the language model using an algorithm called PPO (Proximal Policy Optimization).
But there is a problem. PPO is notoriously unstable, complex to tune, and inefficient. Why? Because in the standard framework, the model generates an entire paragraph, and only then receives a single “score” (reward) at the very end. It’s like taking a 50-question exam and getting a single grade of “C-” without knowing which questions you missed.
A new paper, “DPO Meets PPO: Reinforced Token Optimization for RLHF,” tackles this bottleneck head-on. The researchers propose a novel framework called Reinforced Token Optimization (RTO). Their method changes the game by treating RLHF not as a “bandit” problem with sparse rewards, but as a Markov Decision Process (MDP) with dense, token-level rewards.
By cleverly combining the popular Direct Preference Optimization (DPO) method with PPO, they create a system where the model gets feedback on every single word it generates. The result? A model that learns faster, requires less data, and significantly outperforms standard baselines.
In this post, we will unpack the mathematics of why token-level rewards are superior, how RTO extracts them using DPO, and the impressive results this hybrid approach achieves.
The Bottleneck: RLHF as a Bandit Problem
To understand the innovation of RTO, we first need to look at how RLHF is typically formulated.
In the classic setup (used by InstructGPT, Llama 2, etc.), the process looks like this:
- Prompt: The environment gives the model a prompt \(x\).
- Action: The model generates a full response \(y\) (a sequence of tokens).
- Reward: A Reward Model (trained on human data) looks at the entire response \(y\) and assigns a scalar score \(r(x, y)\).
From a mathematical perspective, this models the problem as a Contextual Bandit. The “action” is the entire sentence. The environment transitions immediately to a terminal state after one action.
The issue with the Bandit formulation is sparsity. The PPO algorithm tries to figure out which specific tokens contributed to a high reward, but the signal is only given at the very end. The standard implementation assigns the reward to the last token and gives a reward of zero to all intermediate tokens (except for a small regularization penalty).
This creates a “credit assignment” problem. If the model writes a beautiful paragraph but makes one factual error at the end, the whole sentence might get a low score. The model struggles to understand that the first 90% was actually good.
The Solution: RLHF as an MDP
The authors of this paper argue that text generation should be modeled as a Markov Decision Process (MDP).
In an MDP:
- State: The current prompt plus all tokens generated so far (\(x, y_{1:h-1}\)).
- Action: The next single token (\(y_h\)).
- Reward: A specific reward received immediately after generating that token (\(r_h\)).
This shift allows for fine-grained token-wise information. Instead of waiting for the end of the sentence, the model receives a signal saying “good token” or “bad token” at every step.

As shown in Figure 1, the difference is structural. On the left (Standard PPO), the reward is a block at the end. On the right (RTO), rewards flow continuously throughout the generation process.
Theoretical Insight: Why Dense Rewards Are Better
Before diving into the algorithm, the authors provide a compelling theoretical justification for why token-wise rewards are superior to sentence-wise rewards. They frame this in terms of sample complexity—essentially, how many training examples you need to find the optimal policy.
Imagine the model is trying to generate a sequence of length \(H\) (say, 3 tokens) from a vocabulary of size \(A\) (say, 2 possible words).
- Sentence-level (Sparse): You generate the whole sequence, get a score, and try again. You have to explore the combinations of full sentences.
- Token-level (Dense): You get feedback after every token. If the first token is bad, you know immediately that any sentence starting with that token is suboptimal. You can prune that entire branch of the “tree” of possibilities.
The authors prove a theorem stating that with sentence-level rewards, the sample complexity scales with \(A^H\) (exponential in sequence length). However, with token-level rewards, the complexity drops significantly, scaling with \(A^{min(\xi+1, H)}\), where \(\xi\) is related to how “obvious” the best tokens are.

Figure 2 visualizes this perfectly. It represents the space of possible responses as a tree.
- In the dense reward setting (shown here), if the model picks a sub-optimal token (a red node), it receives immediate negative feedback. It can effectively “prune” all future branches stemming from that mistake.
- In a sparse reward setting, the model would have to traverse all the way to the leaf nodes (the bottom) for every single path to realize they were wrong.
This theoretical insight explains why RTO is so much more efficient: it stops the model from wasting time exploring paths that started with a bad mistake steps ago.
The Core Method: Reinforced Token Optimization (RTO)
We know we want token-level rewards. But where do we get them? Human labelers rank full sentences; they don’t sit there and score every single “the”, “and”, or “cat” the model outputs.
This is where the “DPO meets PPO” magic happens.
Step 1: Extracting Implicit Rewards via DPO
Direct Preference Optimization (DPO) is a popular algorithm that usually bypasses reward modeling entirely. However, the math behind DPO reveals that the optimized policy implicitly defines a reward function.
The authors derive that for a DPO-trained policy \(\pi_{dpo}\), the optimal token-wise reward \(r^*\) is proportional to the log-ratio between the DPO policy and the reference policy (the original SFT model).
Mathematically, the “implicit” reward for a specific token \(y_h\) given the context \(x, y_{1:h-1}\) is:
\[ r^*(s_h, a_h) \approx \beta \log \frac{\pi_{dpo}(y_h \mid x, y_{1:h-1})}{\pi_{ref}(y_h \mid x, y_{1:h-1})} \]This is a profound realization. Even though DPO is trained on sentence-level preferences (A vs B), the resulting model (\(\pi_{dpo}\)) contains granular, token-level knowledge. If \(\pi_{dpo}\) assigns a much higher probability to a token than the reference model \(\pi_{ref}\), that token likely carries a high reward.
Step 2: Constructing the RTO Reward
RTO uses this implicit value to construct a dense reward function for PPO training. Instead of a sparse zero-everywhere-except-the-end signal, RTO assigns a reward to every token.
The actual reward function used in the algorithm, denoted as \(r_{rto}\), is a mix of the DPO signal and the current policy’s deviation, plus a final sentence-level kick.

Let’s break down this equation (shown in the image above):
- The Dense Signal (\(\beta_1 \dots\)): For every token \(h\), the reward includes \(\beta_1 \log \frac{\pi_{dpo}(...)}{\pi_{ref}(...)}\). This is the “guidance” from the DPO model. If the DPO model thinks a token is good, the PPO agent gets a positive reward immediately.
- The Constraint (\(\beta_2 \dots\)): It subtracts a term related to the current policy \(\pi\) to maintain stability (similar to standard KL penalties).
- The Sparse Signal (\(\beta_3 \cdot r_{MLE}\)): At the very last token (\(h=H\)), they add the traditional sentence-level reward (\(r_{MLE}\)) derived from a standard Reward Model. This ensures the model doesn’t lose sight of the overall goal (e.g., proper length, coherence) that dense rewards might sometimes miss.
Step 3: PPO Training
With this new \(r_{rto}\) function, the training proceeds using the standard PPO algorithm. However, the experience of the agent is vastly different.
- Standard PPO: Agent generates a 100-token sentence. It gets 99 zeros and one reward of +2.0.
- RTO: Agent generates a 100-token sentence. It gets a stream of rewards: +0.1, +0.2, -0.05, +0.3… summing up to a final value.
This dense signal lowers the variance of the gradient estimates, making PPO more stable and allowing it to learn much faster.
Experiments and Results
The authors evaluated RTO on standard high-quality benchmarks: AlpacaEval 2 (measuring helpfulness and instruction following) and Arena-Hard (a challenging benchmark mimicking the Chatbot Arena). They used Llama-3-8B as the base model.
The baselines included:
- SFT: Supervised Fine-Tuning only.
- DPO: Standard Direct Preference Optimization.
- PPO: Standard sparse-reward PPO.
- SimPO, TDPO, R-DPO: Other advanced preference learning methods.
Benchmark Performance
The results were decisive. RTO outperformed all baselines.

As seen in Table 1, RTO achieves a 27.00% length-controlled win rate on AlpacaEval 2, compared to 19.47% for PPO and 17.40% for DPO. That is a massive margin in the world of LLM alignment. Similarly, on Arena-Hard, RTO leads with a score of 20.3 versus PPO’s 16.2.
This confirms that adding dense, token-level signals allows the PPO algorithm to reach heights that sparse signals simply cannot.
Sample Efficiency: Doing More with Less
One of the main theoretical claims was that token-level rewards are more sample-efficient. The experiments backed this up beautifully.
The researchers trained PPO and RTO using fractions of the dataset (e.g., using only 1/8th of the available prompts).

Look at Figure 4(c) above.
- The Pink line (PPO) struggles when data is limited. It stays relatively flat until it has a large amount of data.
- The Teal line (RTO) shoots up almost immediately. With just 1/8th of the data, RTO already matches the performance of PPO trained on the full dataset.
This behavior aligns perfectly with the “Tree Pruning” theory discussed earlier. Because RTO gets feedback on every token, it squeezes significantly more information out of every single training example than standard PPO does.
Why does it work? (Ablation Studies)
The authors performed “ablation studies” (removing parts of the method to see what matters).
- Granularity: They tried assigning the DPO reward only to the end of sentences (like standard PPO) rather than every token. Performance dropped significantly. This proves that the density of the reward is key.
- Reward Shaping: They showed that the DPO reward acts as “reward shaping.” It guides the optimization landscape, smoothing out the rough terrain that PPO usually has to navigate.
Conclusion
RTO represents a significant step forward in the alignment of Large Language Models. By bridging the gap between Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), the authors have unlocked the power of dense rewards without needing expensive token-level human annotations.
Key takeaways for students and practitioners:
- Formulation Matters: Modeling RLHF as a Bandit problem (sparse reward) limits performance. Modeling it as an MDP (dense reward) unlocks efficiency.
- Implicit Rewards: You don’t always need an explicit reward model for every step. Models like DPO implicitly contain this information; you just have to know how to extract it.
- Hybrid Approaches: The debate often frames RLHF as “PPO vs. DPO.” This paper shows that the best solution might be “PPO plus DPO.”
As open-source implementation of RLHF continues to mature, techniques like RTO will likely become standard, helping us train smarter, more aligned models with less data and computational waste.
](https://deep-paper.org/en/paper/2404.18922/images/cover.png)