Beyond the Perfect Prompt: How Multi-Prompt MBR Decoding Unlocks LLM Potential

Introduction

If you have spent any time working with Large Language Models (LLMs), you have likely encountered the frustration of “prompt brittleness.” You spend hours crafting the perfect instruction, only to find that changing a single adjective or the order of examples drastically changes the output. This sensitivity is often seen as a bug, forcing engineers to hunt for a single “magic prompt” that solves a specific task.

But what if we stopped trying to find the one perfect prompt? What if the sensitivity of LLMs to different instructions is actually a feature we can exploit?

This is the core question behind the paper “Improving Minimum Bayes Risk Decoding with Multi-Prompt”. The researchers propose a method that moves away from relying on a single “best” prompt. Instead, they embrace the diversity of possible instructions. By generating outputs from many different prompts and using a statistical consensus method called Minimum Bayes Risk (MBR) decoding, they achieve state-of-the-art results across code generation, text simplification, and machine translation.

In this post, we will break down why standard decoding methods fail to capture the full potential of LLMs, how Multi-Prompt MBR works under the hood, and why “prompt ensembling” might be the future of robust text generation.

Background: The Problem with Being “Most Likely”

To understand why this new method is necessary, we first need to look at how LLMs typically generate text.

Maximum Likelihood vs. Quality

When you ask an LLM a question, standard decoding strategies (like Greedy Search or Beam Search) try to find the sequence of words that has the highest probability. The assumption is simple: High Probability = High Quality.

However, research has repeatedly shown that this assumption is flawed. The “most likely” sequence is often generic, repetitive, or short. It plays it safe. Conversely, human-like, high-quality text often contains surprising (lower probability) words.

Enter Minimum Bayes Risk (MBR) Decoding

To fix the disconnect between probability and quality, researchers use Minimum Bayes Risk (MBR) decoding.

Instead of asking the model for the one most likely sentence, MBR works like a democratic process:

Sample: The model generates a large list of candidate sentences (hypotheses).
Compare: Every candidate is compared against every other candidate using a specific utility metric (like BERTScore or COMET).
Select: The candidate that is most “similar” to all the others—the one that minimizes the risk of being wrong—is selected as the winner.

Mathematically, MBR selects the output \(\hat{y}\) that maximizes the expected utility against the distribution of candidates:

Equation for MBR Decoding.

Here, \(\mathcal{H}\) is the set of hypotheses (candidates), and \(U(y, \mathcal{R})\) represents the utility function comparing a candidate \(y\) against the reference set \(\mathcal{R}\) (which is usually the hypothesis set itself).

In simpler terms: Standard decoding looks for the mode of the probability distribution. MBR looks for the consensus of the semantic distribution.

The Limitation of Single-Prompt MBR

Standard MBR is powerful, but it has a bottleneck: the Candidate Set. Typically, to get a diverse set of candidates from a single prompt, you have to increase the sampling “temperature” (randomness).

However, high temperature introduces noise. As you turn up the heat, the model starts making mistakes or hallucinating just to be “diverse.” You are trading quality for variety. This is where the authors’ contribution changes the game.

The Core Method: Multi-Prompt MBR

The researchers propose Multi-Prompt MBR. Instead of forcing diversity through high-temperature randomness on a single prompt, they generate diversity by asking the model to solve the task using many different prompts.

The Intuition

The intuition here is fascinating: diverse prompts guide the model toward different “modes” (regions) of the output space.

Prompt A might encourage the model to be concise.
Prompt B might encourage the model to use formal language.
Prompt C might focus on structural simplification.

Each prompt produces a valid, high-quality distribution of answers. By combining them, you get a “super-distribution” that covers more ground than any single prompt could.

Figure 2: Multi-prompt MBR generates candidates using a human- or model-written prompt bank and selects the highest pairwise score with a trained value metric.

As shown in Figure 2, the process is straightforward:

Prompt Bank: Create a collection of different prompts for the same task.
Generate: Feed all prompts into the LLM to generate a massive set of candidates (\(\{y_1, y_2, ..., y_n\}\)).
Score: Use a utility metric (like COMET or LENS) to compute a similarity matrix between all candidates.
Rank: Select the candidate with the highest average similarity score.

This results in a hypothesis set that is the union of all outputs from the individual prompts:

Equation for the union of hypothesis sets.

Why “Prompt Diversity” beats “Sampling Noise”

The authors demonstrate that varying the prompt is a much safer way to get diverse answers than just increasing the randomness.

Take a look at Figure 3 below.

Figure 3: Comparisons of LENS score and sequence probability for text simplification. Comparison of single vs multi-prompt distributions.

In Graph (a), look at the difference between the top row (Single Prompt) and the bottom row (Multi-Prompt).

Single Prompt (Top): As you increase temperature (\(\tau\)), the candidates spread out (diversity), but many drift into low-quality territory (yellow/green dots).
Multi-Prompt (Bottom): Even at low temperature (\(\tau=0\)), using different prompts naturally finds distinct clusters of high-quality outputs.

Graph (c) is particularly telling. It shows that while individual prompts vary in quality (some are great, some are mediocre), the Multi-Prompt ensemble (the blue bar) outperforms even the best single prompt.

This confirms the hypothesis: The consensus of diverse perspectives is smarter than the single smartest perspective.

Constructing the Prompt Bank

You might be wondering: “Where do I get 100 different prompts for the same task?” The authors use a semi-automated approach.

Seed Prompts: Humans write a small set of instructions (e.g., 10 prompts).
Paraphrasing: Use a strong model (like GPT-4) to rephrase these instructions into many variations.

Prompt Selection Strategies

Once you have a bank of prompts, you don’t necessarily use all of them blindly. The paper investigates two main ways to pick which prompts to use at inference time:

Prompt Selection (Heuristics): Picking a fixed subset based on properties like semantic distance (to ensure diversity) or accuracy on a test set.
Prompt Sampling (Probabilistic): Learning a probability distribution over the prompts based on how often they produce the winning candidate on a training set.

The authors propose a Top-p Prompt Sampling method. They calculate the probability \(p(\rho)\) of a prompt being useful and truncate the distribution to remove “bad” prompts that never help:

Equation for Top-p Prompt Sampling. Equation for renormalizing prompt probabilities.

This ensures that the system focuses its compute resources on prompts that are statistically likely to yield high-quality candidates.

Experiments & Results

The authors tested this method on three distinct tasks using various open-source LLMs (Llama 2, ALMA, CodeLlama):

Code Generation (HumanEval)
Text Simplification (SimpEval)
Machine Translation (WMT ‘22)

1. Does Multi-Prompt outperform Single-Prompt?

Resoundingly, yes.

Figure 1: Multi-prompt and single prompt MBR results for code generation, text simplification, and translation.

Figure 1 shows the performance as the number of candidates increases.

Blue Line (Multi-Prompt): Consistently higher than the red line.
Diminishing Returns: Notice that for Code Generation (Left), the gap remains wide. For Translation (Right), the gap is narrower, likely because translation models are already very strong, leaving less room for improvement.

2. Does Candidate Diversity Matter?

The authors investigated whether the improvement was truly due to diversity.

Figure 4: Candidate set diversity and LENS scores on SIMPEVAL.

Figure 4 plots the number of “Novel Bigrams” (a proxy for diversity) against Temperature.

Left Graph: Multi-Prompt (Blue) produces significantly more novel bigrams than Single Prompt (Red) at the same temperature.
Right Graph: This diversity correlates directly with higher LENS scores (quality).

This proves that Multi-Prompt allows us to maintain high diversity at low temperatures, avoiding the quality degradation usually associated with high-temperature sampling.

3. Impact of Prompt Sampling Strategies

Does it matter how we pick the prompts? The data suggests it does.

Table 1: Results for prompt sampling and selection strategies.

Table 1 compares different strategies. The key takeaway is that Top-p Prompt Sampling (the last row in the top section) consistently yields the best results. It beats random selection, proving that not all prompts are created equal—we should favor the ones that historically perform well, while still maintaining enough variety for the MBR consensus to work.

4. Scaling Across Models

One might assume this technique is a crutch for smaller, weaker models. However, the experiments show consistent gains even as model size increases.

Figure 5: Delta metric improvement from single prompt to multi-prompt across model sizes.

Figure 5 shows the improvement (\(\Delta\)) gained by switching to Multi-Prompt.

Code Generation (Top): Massive gains across all model sizes.
Cross-over Effect: A fascinating finding (detailed in the paper’s full results) is that smaller models using Multi-Prompt often outperform larger models using Single-Prompt. For example, a 13B model with Multi-Prompt can beat a 70B model with standard decoding. This has huge implications for deployment efficiency—you might be able to use a cheaper model if you decode it intelligently.

Here is a detailed look at the absolute performance across different specific models:

Figure 10: Results of multi-prompt MBR compared to single prompt MBR across model sizes and architectures.

Figure 10 reinforces that the blue line (Multi-Prompt) sits above the single-prompt baseline for almost every model architecture tested, from Llama 2 to specialized CodeLlama models.

5. Efficient Alternatives

The “elephant in the room” with MBR is cost. Comparing every candidate against every other candidate is computationally expensive (\(O(n^2)\) complexity).

The authors explored “Reference-Free Reranking” as a cheaper alternative (\(O(n)\)). In this setup, a separate model scores each candidate individually, rather than comparing them to each other.

Figure 6: Alternative MBR formulations for multi-prompt across candidate set sizes.

Figure 6 shows mixed results:

Translation (Right): A simple Reranker (green line) works almost as well as full MBR.
Code Generation (Left): Full MBR (blue line) is significantly better.
Simplification (Middle): A hybrid approach (“Multi-turn MBR”) works best.

This suggests that while Multi-Prompt is powerful, the optimal selection mechanism depends on the specific task.

Comparison to Beam Search and Oracle

To rigorously validate the method, the authors compared Multi-Prompt MBR against a Beam Search baseline and an “Oracle” (the theoretical upper bound if we perfectly selected the best candidate from the set).

Figure 7: Multi-prompt, single prompt and beam search MBR decoding performance.

Figure 7 is illuminating:

Beam Search (Black Diamonds): Performs poorly compared to MBR approaches, reinforcing the industry shift away from Beam Search for open-ended generation tasks.
Oracle (Grey Circles): Notice that for Code Generation (Left) and Translation (Right), the Multi-Prompt method (Blue) is actually tracking closer to the Oracle performance than the Single Prompt method. This indicates that Multi-Prompt isn’t just generating more options; it is generating better options that simply need to be selected.

Conclusion & Implications

The research presented in “Improving Minimum Bayes Risk Decoding with Multi-Prompt” offers a compelling argument for moving beyond “Prompt Engineering” toward “Prompt Ensembling.”

Key Takeaways:

Don’t bet on one prompt: Even the best human-written prompt is unlikely to capture the full distribution of correct answers.
Diversity is quality: By varying instructions, we access different capabilities of the model, creating a richer pool of candidates.
MBR is the filter: Multi-Prompt generation creates the hay; MBR is the magnet that finds the needle.
Punching above weight: This technique allows smaller, open-source models to rival the performance of much larger models (and even closed-source giants like GPT-4 in some metrics).

As LLMs continue to integrate into critical workflows, techniques like Multi-Prompt MBR will be essential for trading a bit of inference time for significantly higher reliability and quality. Instead of searching for the perfect magic words to control the AI, we should perhaps just ask it to solve the problem in many different ways—and trust the consensus.

Introduction#

Background: The Problem with Being “Most Likely”#

Maximum Likelihood vs. Quality#

Enter Minimum Bayes Risk (MBR) Decoding#

The Limitation of Single-Prompt MBR#

The Core Method: Multi-Prompt MBR#

The Intuition#

Why “Prompt Diversity” beats “Sampling Noise”#

Constructing the Prompt Bank#

Prompt Selection Strategies#

Experiments & Results#

1. Does Multi-Prompt outperform Single-Prompt?#

2. Does Candidate Diversity Matter?#

3. Impact of Prompt Sampling Strategies#

4. Scaling Across Models#

5. Efficient Alternatives#

Comparison to Beam Search and Oracle#

Conclusion & Implications#