Introduction

We are currently witnessing a paradigm shift in Large Language Models (LLMs). We are moving from “chatbots”—models that answer a single query—to Language Agents. These are systems capable of browsing the web to buy products, conducting scientific experiments in simulation, or managing complex workflows.

However, training these agents is significantly harder than training a standard chatbot. While a chatbot only needs to get the next token right, an agent must take a sequence of correct actions to reach a goal. If an agent makes a mistake in step 3 of a 20-step process, the entire trajectory might fail. This introduces the problem of compounding errors, where small deviations from an optimal path spiral into complete failure.

To align models with human intent, Direct Preference Optimization (DPO) has recently become the gold standard. It allows us to train models using “winning” vs. “losing” examples without needing a complex Reinforcement Learning (RL) loop. But there is a catch: DPO was mathematically derived for single-turn interactions (bandit settings). When you try to force DPO onto multi-turn agent tasks, the math breaks down, specifically regarding the normalization of probabilities (the partition function).

In this post, we will deep dive into a solution to this problem presented in the paper “Direct Multi-Turn Preference Optimization for Language Agents.” The researchers introduce DMPO, a novel loss function that:

  1. Extends DPO to multi-turn settings with a rigorous theoretical foundation.
  2. Eliminates the dependency of the partition function on the current state.
  3. Introduces length normalization to handle trajectories of varying durations.

Illustration of DMPO loss, which directly optimizes the RL objective by maximizing the likelihood of the preferred trajectory over the dispreferred trajectory.

As illustrated in Figure 1, DMPO takes an instruction and preference data (a “win” trajectory vs. a “lose” trajectory) and optimizes the agent to maximize the likelihood of the winning path. By the end of this article, you will understand the mathematical “hack” that makes this possible and why it matters for the future of AI agents.

Background: The Challenge of Multi-Turn Agents

To understand DMPO, we first need to understand the limitations of current training methods for agents.

From Chatbots to Markov Decision Processes

In a standard LLM setting, the model generates a response \(y\) given a prompt \(x\). In an agent setting, the problem is formulated as a Markov Decision Process (MDP).

  • State (\(s_t\)): The current context (e.g., the webpage the agent is looking at, plus conversation history).
  • Action (\(a_t\)): What the agent writes or does (e.g., “Click [Buy Now]” or “Search [Blue Shoes]”).
  • Transition: The environment reacts to the action, giving a new state \(s_{t+1}\).
  • Reward: The agent receives a signal indicating success or failure.

The goal is to maximize the expected cumulative reward over the whole trajectory \(\tau = (s_0, a_0, s_1, a_1, \dots)\).

The Failure of Behavioral Cloning

The simplest way to train an agent is Behavioral Cloning (BC)—essentially supervised fine-tuning (SFT). You take an expert trajectory and tell the model: “Do exactly what the expert did.”

While easy to implement, BC suffers from covariate shift. If the trained agent encounters a state slightly different from what it saw in the training data (a state the expert never visited), it doesn’t know how to recover.

Illustration of expert trajectories and trajectories learned under the constraints of policy and stateaction occupancy measure.

Figure 2 illustrates this perfectly.

  • The Red arrows are the expert path.
  • The Green arrows represent a model trained with standard policy constraints (like BC). At state \(S_2\), the model drifts away (to \(S_5\)) because it blindly mimics the policy without understanding the distribution of valid states.
  • The Blue arrows represent a model constrained by State-Action Occupancy Measure (SAOM)—a key concept in DMPO which we will discuss later. This constraint encourages the model to stay within the distribution of states visited by the expert, effectively pulling the agent back on track if it deviates.

The Limitation of DPO

To fix the issues of BC, we usually turn to Reinforcement Learning (RL). However, RL is notoriously unstable and hard to tune (requiring a separate Critic model, Reward model, etc.).

Direct Preference Optimization (DPO) changed the game by showing that you can optimize the RL objective directly using a classification loss on preference pairs (\(y_{win}\) vs \(y_{lose}\)), skipping the explicit reward modeling phase.

The DPO derivation relies on a mathematical trick involving the Bradley-Terry (BT) model. In single-turn DPO, the probability that response \(y_1\) is better than \(y_2\) depends on the implicit rewards of those responses. The derivation conveniently cancels out a term called the partition function (\(Z\)), which acts as a normalizer.

Here is the problem: In single-turn settings, \(Z\) depends on the prompt \(x\), which is constant for both \(y_1\) and \(y_2\). Therefore, it cancels out. In multi-turn settings, the “prompt” changes at every step (the state changes). The partition function \(Z(s)\) becomes dependent on the specific state sequence. Since the winning and losing trajectories visit different states, \(Z(s)\) does not cancel out.

Applying standard DPO to multi-turn tasks ignores this mathematical reality, leading to suboptimal performance. DMPO is designed to fix this specific mathematical gap.

Core Method: Deriving DMPO

This section is the heart of the paper. We will walk through how the authors derive a loss function that looks like DPO but works for long, multi-step trajectories.

Step 1: Rethinking the Constraint

In standard RL optimization, we try to maximize reward while keeping the trained policy \(\pi_{\theta}\) close to a reference policy \(\pi_{ref}\) (usually the pre-trained model) to prevent “mode collapse” or generating gibberish. This is usually done with a KL-divergence constraint on the policy:

Equation showing the RL objective with a KL divergence constraint on the policy.

The equation above says: Maximize expected reward (first term) minus the distance between our policy and the reference policy (second term).

The authors propose a shift. Instead of constraining the policy (\(\pi(a|s)\)), we should constrain the State-Action Occupancy Measure (SAOM), denoted as \(d(s, a)\).

What is SAOM?

The SAOM, \(d^\pi(s, a)\), represents the probability of the agent being in state \(s\) and taking action \(a\) over the course of a trajectory. It’s a global view of “where the agent spends its time.”

Equation defining the discounted state-action occupancy measure.

Why make this switch? In imitation learning, constraining the SAOM is often more robust to compounding errors (as shown in the Blue arrows of Figure 2). But more importantly for this paper, it fixes the math.

When we change the constraint from policy to SAOM, the RL objective changes to:

Equation showing the RL objective with a KL divergence constraint on the SAOM.

This looks similar, but the variable inside the KL term is now \(d(s,a)\). The optimal solution to this specific objective has a closed form:

Equation showing the optimal SAOM solution.

The Crucial Insight: Look at the equation above. \(Z\) is the partition function (the normalizing constant). Because \(d(s,a)\) is a global measure over the whole MDP, normalizing it results in a partition function \(Z\) that is independent of the current state. \(Z\) becomes a global constant for the task.

This means we can rearrange the equation to express the reward \(r(s,a)\) in terms of the optimal occupancy measure:

Equation rearranging the optimal SAOM to solve for reward r(s,a).

Because \(Z\) is now a constant, when we compare two trajectories, \(Z\) will eventually cancel out!

Step 2: The Multi-Turn Bradley-Terry Model

Now that we have a definition for reward, we need to plug it into a preference model. We assume that the probability of a “win” trajectory (\(\tau^w\)) being preferred over a “lose” trajectory (\(\tau^l\)) follows the Bradley-Terry model based on the sum of their rewards.

Equation showing the probability of win vs lose based on summed rewards.

However, there is a catch. The “win” trajectory might be 5 steps long, while the “lose” trajectory might be 20 steps long. Simply summing the rewards introduces a bias based on length. The partition function \(Z\) (from the previous step) would be added \(T_w\) times for the winner and \(T_l\) times for the loser. If \(T_w \neq T_l\), \(Z\) won’t cancel out.

To solve this, the authors introduce Length Normalization. They normalize the cumulative reward by the effective length of the trajectory.

Equation showing the Bradley-Terry model with length normalization added.

By normalizing the sum, the contribution of the constant \(Z\) becomes balanced on both sides, allowing it to be mathematically eliminated. This provides a theoretical justification for length normalization, which previously was just used as a heuristic.

Step 3: The Final DMPO Loss

We now have all the ingredients:

  1. Reward Function: Defined via the ratio of the optimal SAOM to the reference SAOM (Eq 11).
  2. Preference Model: A length-normalized Bradley-Terry model (Eq 13).

We substitute the Reward Function into the Preference Model. We want to maximize the likelihood of the preferred trajectory.

This results in the intermediate loss function involving occupancy measures:

Equation showing the intermediate DMPO loss using occupancy measures.

This equation is theoretically sound but practically difficult. We don’t easily know the full occupancy measure \(d^\pi(s,a)\) because it depends on the environment’s transition dynamics (\(P(s'|s,a)\)), which are often unknown or complex (like the dynamics of a web browser).

The Cancellation Trick: The occupancy measure \(d^\pi\) is composed of the policy \(\pi\) and the environment transitions \(P\). Fortunately, we are looking at the ratio between the trained model and the reference model: \(\frac{d^{\pi_\theta}}{d^{\pi_{ref}}}\).

Since the environment dynamics (\(P\)) are the same regardless of which policy is acting, the transition probabilities cancel out!

Equation showing how SAOM decomposes into policy and transition probabilities.

When we plug this decomposition back into the loss function, the unknown transition probabilities vanish, leaving us with only the policy terms \(\pi_\theta\) and \(\pi_{ref}\), which we can compute.

This yields the final DMPO Loss Function:

The final DMPO loss equation.

Interpreting the Formula

Let’s look closely at the final equation (Eq 16 in the paper image above). It resembles the standard DPO loss, but with a critical addition: the term \(\phi(t, T)\).

  • Log-Ratio: We represent the “reward” as the log-ratio of our policy to the reference policy. If our model is more likely to take the winning action than the reference model, the reward is high.
  • Discount Function \(\phi(t, T)\): This is the new component. It reweights the importance of each step \(t\) in the trajectory of length \(T\). \[ \phi(t,T) = \gamma^t \cdot \frac{1 - \gamma^{T-t}}{1 - \gamma^T} \] This function implies that DMPO assigns different weights to different steps in the sequence. It’s not just averaging; it’s considering the discount factor \(\gamma\).

Interesting properties:

  1. Early Step Bias: The gradient analysis shows that DMPO assigns higher weights to state-action pairs in the early steps of a trajectory. This aligns with intuition: if you make a mistake at step 1, the rest of the trajectory is doomed. Correcting early actions is crucial.
  2. DPO Compatibility: If you set \(\gamma \to 0\) (essentially ignoring the future), the loss degenerates back into the single-turn DPO loss.

Experiments & Results

Does this mathematical rigor translate to better agents? The authors tested DMPO on three complex benchmarks: WebShop (e-commerce), ScienceWorld (scientific experiments), and ALFWorld (household tasks).

They compared DMPO against standard DPO and other baselines in two settings: Noisy and Clean.

The “Noisy” Setting (Robustness Test)

In the real world, “losing” trajectories aren’t always terrible; they might just be slightly suboptimal or noisy. To simulate this, the authors generated “lose” trajectories using a fine-tuned model that occasionally hallucinated or repeated actions.

Table comparing DPO and DMPO results in the Noisy setting.

Table 2 shows the results.

  • DMPO Wins: Across almost all datasets (WebShop, ScienceWorld, ALFWorld), DMPO outperforms standard DPO.
  • Why? DPO treats every step equally and struggles with the variable lengths of noisy trajectories. DMPO’s length normalization and discounting allow it to filter out the noise effectively. The performance gap is particularly visible in the Mistral-7B experiments on ScienceWorld (0.700 vs 0.708) and ALFWorld (0.883 vs 0.888).

The “Clean” Setting (Performance Ceiling)

Here, the “lose” trajectories are high-quality filtered trajectories (hard negatives). This allows us to see the maximum potential of the method.

Table comparing DMPO against baselines like SFT, PPO, and ETO in the Clean setting.

Table 3 compares DMPO against a wide range of baselines, including PPO (standard RL), SFT (Behavioral Cloning), and ETO (another trajectory optimization method).

  • SOTA Performance: DMPO achieves the highest scores. On ScienceWorld, it reaches 0.724, beating PPO (0.594) by a massive margin. This confirms that directly optimizing preferences is more stable than PPO, and doing it with the correct math (DMPO) is better than single-turn approximations.
  • Outperforming GPT-4: On WebShop, the Llama-2-7B model trained with DMPO (0.701) significantly outperforms the base GPT-4 (0.632).

Ablation: The Impact of Trajectory Length

One of the key claims of DMPO is that it handles length discrepancies better. The authors verified this by grouping trajectories by length.

Bar chart showing the effect of trajectory length on DPO vs DMPO performance.

Figure 4 is very telling.

  • DPO (Blue bars): As the trajectory length increases (from 50 to 200), DPO’s performance drops significantly. The partition function error accumulates, and the length bias hurts the model.
  • DMPO (Orange bars): Performance remains stable even as trajectories get longer. This empirically validates the theoretical inclusion of length normalization.

Ablation: The Discount Factor \(\gamma\)

The hyperparameter \(\gamma\) controls how much the model cares about future rewards versus immediate actions.

Line charts showing relative performance vs gamma for noisy and clean settings.

Figure 3 reveals an interesting dynamic:

  • In Noisy settings (Red line), a smaller \(\gamma\) is often better. This effectively tells the model: “Focus on the immediate next step, the distant future is too noisy to trust.”
  • In Clean settings (Orange line), a larger \(\gamma\) works better. The model can trust the long-term “lose” trajectories and learn strategic planning.

Conclusion

Adapting Large Language Models to act as agents is one of the most important frontiers in AI. While we have mastered the art of the single-turn chat, multi-step reasoning requires different tools.

The paper “Direct Multi-Turn Preference Optimization for Language Agents” identifies a subtle but critical flaw in applying standard DPO to agent tasks: the partition function does not cancel out when trajectories vary in states and lengths.

By shifting the optimization constraint from the policy to the State-Action Occupancy Measure, DMPO makes the partition function a global constant. By adding length normalization, it accounts for the fact that a 5-step success is different from a 20-step failure.

The result is a loss function that is theoretically sound for MDPs and practically superior in experiments. For students and practitioners building the next generation of agents, DMPO offers a robust alternative to PPO and standard DPO, enabling agents that can plan longer, stay on track better, and handle the noise of the real world.

Key Takeaways:

  • Don’t just copy-paste: Algorithms like DPO designed for bandits (single-step) don’t automatically work for MDPs (multi-step).
  • Math matters: The derivation of DMPO shows that rigorous mathematical grounding (fixing the \(Z\) term) leads to tangible performance gains.
  • Normalization is key: Handling trajectory length differences is essential when learning from preferences in agent tasks.