Introduction
The rapid evolution of Large Language Models (LLMs) has brought us closer to artificial general intelligence, but raw intelligence isn’t enough. We need models that are aligned—helpful, harmless, and honest. Traditionally, achieving this alignment has been a resource-heavy endeavor. It typically involves Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF). While effective, this pipeline is expensive, computationally demanding, and reliant on vast amounts of human-annotated preference data.
But what if a model could align itself without updating its weights? What if, instead of expensive training, we could optimize the instructions we give the model so effectively that it outperforms fine-tuned versions?
This is the premise behind DRPO (Dynamic Rewarding with Prompt Optimization). In a recent paper, researchers introduced a tuning-free approach that treats alignment not as a training problem, but as a search and optimization problem. By leveraging the model’s own capabilities to critique and improve itself, DRPO enables base models to surpass their RLHF-tuned counterparts without a single gradient update.

As illustrated in Figure 1, DRPO occupies a sweet spot. It avoids the high training costs of standard alignment (RLHF) and the performance ceiling of static prompting methods (tuning-free), offering a low-cost, high-performance alternative.
In this post, we will deconstruct how DRPO works, the mathematics behind its optimization framework, and why “Dynamic Rewarding” is a game-changer for self-alignment.
Background: The Alignment Landscape
To appreciate DRPO, we first need to understand the current bottlenecks in LLM alignment.
Standard Alignment (SFT & RLHF) The industry standard involves training a reward model on human preferences and then using reinforcement learning to steer the LLM to maximize that reward. This “alignment tax” is high. It requires curating datasets, managing unstable training dynamics, and spending massive compute resources.
Self-Alignment Researchers have explored “self-alignment,” where the model generates its own training data or critiques. While this reduces the need for human data, it usually still results in a fine-tuning step. The model is still being trained, just on synthetic data.
Tuning-Free Alignment This is where DRPO fits in. Tuning-free methods attempt to align a model purely through inference strategies—usually via clever prompting or decoding-time intervention. Until now, these methods have been static. A human writes a “system prompt” (e.g., “You are a helpful assistant…”), and we hope for the best. The limitation is obvious: a fixed prompt cannot account for the model’s specific blind spots or the nuances of every possible user query.
The Core Method: Dynamic Rewarding with Prompt Optimization
The researchers behind DRPO formulated alignment as a search-based prompt optimization problem. Instead of manually guessing the best prompt, they built an automated system that iteratively refines the prompt until the model’s behavior is optimized.
The framework consists of two main stages:
- Optimization of In-Context Learning (ICL) Examples: Finding the perfect examples to show the model.
- Optimization of the System Prompt: Crafting the perfect instructions.
Let’s break down the architecture and the math.
Problem Formulation
The goal is to generate an aligned response \(y\) given an input \(x\). The response is conditioned on a system prompt \(\mathcal{P}\) and a set of In-Context Learning examples \(\mathcal{I}_K\).

Here, \(\mathcal{B}\) represents the Base LLM. The objective of DRPO is to find the optimal prompt \(\mathcal{P}^*\) and the optimal subset of examples \(\mathcal{I}_K^*\) that maximize the expected alignment score across a distribution of inputs.

This equation essentially says: “Find the specific instructions and examples that, on average, produce the best possible responses from the model.”
The Framework: LLMs as Optimizers
DRPO treats the prompt generation process as a Markov Decision Process (MDP).
- State (\(S\)): The current prompt or ICL example.
- Action (\(A\)): The modification applied to the prompt (e.g., rewriting a sentence, adding a constraint).
- Reward (\(R\)): A score indicating how well the current prompt performs.
The researchers use a search algorithm—specifically Beam Search—to navigate this space. They start with a basic prompt, generate variations, evaluate them, and keep the best ones for the next round of refinement.

As shown in Figure 3, the process is cyclical. The system generates a response, critiques it, calculates a reward, and then updates the prompt (“Next State”) to address specific failures found in the critique.
The Innovation: Dynamic Rewarding
The most critical contribution of this paper is Dynamic Rewarding.
In traditional RLHF, the reward model is often a “black box” trained to predict a single scalar score representing human preference. However, “alignment” is not a single dimension. A good response to a medical question requires accuracy and safety. A good response to a creative writing prompt requires creativity and engagement. A static reward function often fails to capture these context-dependent nuances.
DRPO solves this by allowing the model to decide which metrics matter for a specific query.

In this equation:
- \(\mathbb{R}_q\) is the set of relevant reward criteria selected dynamically for the query \(q\).
- \(r(\sigma)\) is the score for a specific criterion (e.g., “Helpfulness: 5/5”).
If the user asks for the current weather, the dynamic reward mechanism might select “Factuality” and “Limitations” (checking if the model admits it doesn’t have internet access) as the key metrics. If the user asks for a joke, it selects “Creativity” and “Humor.”
Step 1: Optimizing In-Context Learning (ICL) Examples
The first phase of DRPO focuses on the “examples” provided to the model. In-context learning is powerful; showing the model a pair of (Query, Good Response) guides it significantly.
The algorithm takes a base set of examples and optimizes the response part of each example.
- State (\(s_t\)): The current draft of the response in the example.
- Dynamic Reward (\(r_t\)): The evaluator checks the response against dynamically selected criteria.
- Action/Transition (\(a_t\)): An “Optimizer LLM” (e.g., GPT-4) rewrites the response to improve the score.

The state is then updated based on this feedback:

This results in a “Universal Set” of highly optimized ICL examples (\(\mathcal{I}^*\)) that demonstrate perfect behavior.
Step 2: Optimizing the System Prompt
Once the examples are optimized, DRPO fixes them and moves on to the system prompt—the high-level instructions (e.g., “You are an ethical assistant…”).
This is harder because a system prompt must be generalizable. It cannot just be good for one query; it must work for any query.
To solve this, the algorithm:
- Samples a batch of training queries (\(x_t\)).
- Retrieves the optimized ICL examples (\(\mathcal{I}_K^*\)) relevant to those queries.
- Generates a response using the current system prompt (\(s_t\)).
- Evaluates the response.
- Updates the System Prompt to address weaknesses observed in the responses.

If the model consistently fails to refuse harmful queries, the Optimizer LLM injects a specific instruction into the system prompt: “If the user asks for illegal acts, politely decline.” If the model is too robotic, it adds: “Use a conversational tone.”
This iterative process builds a highly detailed, model-specific system prompt that acts as a “patch” for the model’s inherent weaknesses.
Experiments and Results
The researchers evaluated DRPO using just-eval-instruct, a comprehensive benchmark covering helpfulness, clarity, factuality, depth, and safety. They tested across 8 models, including open-source (Mistral, Llama 2/3) and closed-source (GPT-3.5, GPT-4) models.
Main Results: Beating the Baselines
The results were striking. DRPO consistently outperformed standard baselines.

As Figure 2 demonstrates, for models like Mistral 7b and Llama 2 70b, the Base model + DRPO outperformed the Official Instruct/Chat versions. This implies that inference-time optimization can be more effective than the heavy fine-tuning (RLHF) performed by the model creators.
Table 1 provides a granular look at the metrics:

Take Mistral 7b as an example:
- Base Model (No alignment): Score 2.10
- URIAL (Previous SOTA tuning-free): Score 3.56
- Mistral 7b Instruct (RLHF tuned): Score 3.66
- Base Model + DRPO: Score 4.06
DRPO essentially “unlocked” alignment capabilities in the base model that surpassed the fine-tuned version. It also significantly improved already-tuned models (e.g., GPT-3.5-Turbo improved from 4.14 to 4.55).
Categorized Performance
Does this improvement hold up across different topics? The researchers analyzed performance across domains like STEM, Reasoning, and Humanities.


The radar charts (Figures 5 and 6) show that the gains are broad. DRPO (in blue) almost completely encompasses the RLHF/SFT performance (in orange).
Ablation Studies: What Matters?
The researchers performed rigorous ablation studies to confirm which components of DRPO drove these results.
1. Is the System Prompt or ICL more important? Table 3 (below) shows that removing the optimized system prompt or the ICL examples drops performance. However, using both yields the best results. Interestingly, removing ICL examples caused a larger drop than removing the system prompt, highlighting the power of showing the model what to do.

2. Does the search algorithm matter? Could we just use Greedy Search (taking the first improvement we find) instead of Beam Search (exploring multiple paths)?

Table 4 confirms that Beam Search is superior. The ability to maintain multiple potential prompt candidates allows the algorithm to escape local optima and find a truly robust prompt.
3. Does Dynamic Rewarding matter? This is the core hypothesis check. They compared Dynamic Rewarding against “Static Rewarding” (using a fixed set of all criteria every time).

Table 5 proves that dynamic is better. By focusing only on relevant criteria, the optimizer receives sharper, more actionable feedback. If a query is about coding, critiquing the “empathy” of the response creates noise. Dynamic rewarding filters out that noise.
4. How many examples (K) do we need? Context window space is valuable. How many ICL examples are ideal?

Figure 4 reveals a “less is more” trend. Performance peaked at just 2 examples. Adding more examples actually degraded performance slightly, likely due to context dilution or distraction. This makes DRPO highly efficient at inference time.
Qualitative Analysis: The Optimized Prompts
What do these “optimized prompts” actually look like? They are fascinating artifacts of the model’s self-reflection.

Looking at Table 8, we see distinct differences based on the model:
- Mistral 7b (Smaller Model): The prompt includes specific instructions like “Avoid unnecessary repetition” and “You do not have access to the internet.” DRPO detected that Mistral tends to repeat itself and hallucinate internet access, so it patched those behaviors in the prompt.
- GPT-3.5-Turbo (Larger Model): The prompt focuses on higher-level goals like “Avoid overly technical jargon” and “Delve into authorial intent.” It assumes basic competency and pushes for stylistic refinement.
Conclusion
The “Dynamically Rewarding with Prompt Optimization” (DRPO) paper presents a compelling argument: We don’t always need to train models to align them.
By modeling alignment as an inference-time optimization problem, DRPO allows models to:
- Self-Correct: Using their own reasoning to improve instructions.
- Adapt: Using dynamic rewarding to apply the right criteria to the right queries.
- Outperform: Surpassing traditionally fine-tuned models by simply finding the “magic words” (prompts and examples) that unlock their latent potential.
Why does this matter? For students and researchers, this highlights a shift in the AI development lifecycle. We are moving from a paradigm of Training \(\to\) Deployment to one of Training \(\to\) Optimization \(\to\) Deployment. DRPO proves that a significant portion of “alignment” is latent within the base model, waiting to be unlocked by the right context.
While DRPO introduces a one-time computational cost to search for these prompts (see the cost equations in the appendices below), the resulting inference is efficient (requiring only ~2 examples) and the prompts are reusable.

As LLMs continue to grow in size and training costs skyrocket, tuning-free methods like DRPO offer a democratized, accessible path to safer and more helpful AI.
](https://deep-paper.org/en/paper/2411.08733/images/cover.png)