Recent advances in Generative AI, particularly Large Language Models (LLMs), have cemented “learning from preferences” (like Reinforcement Learning from Human Feedback, or RLHF) as a critical step in model training. We know empirically that telling a model “Response A is better than Response B” often yields better results than simply showing it “Response A” as a good example.
But from a statistical perspective, why is this the case? Does preference data simply add more information, or does it fundamentally change the mathematical nature of the learning process?
In the research paper “Learning Parametric Distributions from Samples and Preferences,” researchers Marc Jourdan, Gizem Yüce, and Nicolas Flammarion dive deep into the statistical mechanics of this phenomenon. They uncover a fascinating gap: while standard sample-based learning converges at a rate of \(O(1/\sqrt{n})\), utilizing deterministic preferences allows for a significantly faster convergence rate of \(O(1/n)\).
In this post, we will break down their framework, explore the difference between noisy and deterministic preferences, and understand the geometry that allows preference-based learning to break the standard speed limits of statistical estimation.
1. The Problem Setup: Samples vs. Preferences
To understand the core contribution, we must first strip away the complexity of neural networks and focus on a fundamental statistical problem: Parametric Estimation.
Imagine you have a probability distribution \(p_{\theta^*}\) governed by an unknown parameter \(\theta^*\) (for example, the mean and variance of a Gaussian distribution). Your goal is to estimate \(\theta^*\).
Two Types of Data
The learner has access to two sources of information:
- Samples: You observe pairs of data points \((X, Y)\) drawn from the distribution.
- Preferences: You are given a label \(Z\) indicating which sample is “better” based on a preference function \(\ell_{\theta}\).
The relationship between the parameter and the preference is crucial. The researchers distinguish between two types of feedback mechanisms.
Deterministic (Hard) Feedback: Here, the preference is strictly determined by the sign of the preference function. If \(X\) is better than \(Y\) according to \(\theta^*\), you observe it with 100% certainty.

Stochastic (Noisy) Feedback: This is the scenario typically modeled in RLHF (using the Bradley-Terry model). The probability of preferring \(X\) over \(Y\) passes through a sigmoid function \(\sigma\). There is noise; you might occasionally prefer the “worse” option.

2. The Baseline: Sample-Only (SO) Estimation
Before adding preferences, let’s establish the baseline. The standard way to estimate parameters is the Sample-Only Maximum Likelihood Estimator (SO MLE). You simply find the parameter \(\theta\) that maximizes the probability of observing your data.

Standard statistical theory (specifically the properties of M-estimators) tells us that this estimator is asymptotically normal. As the number of samples \(n\) approaches infinity, the error distribution looks like a Bell curve centered at zero. Crucially, the error scales at a rate of \(1/\sqrt{n}\).

This \(1/\sqrt{n}\) rate is the “speed limit” for unbiased estimators based on samples. To halve your error, you need four times as much data.
3. Scenario A: Stochastic Preferences (The Mild Improvement)
What happens when we add preference data \(Z\) to our samples? If the preferences are noisy (stochastic), we use the Stochastic Preferences MLE (SP MLE). This method minimizes the standard negative log-likelihood of the samples plus a classification loss term based on the preferences.

The researchers found that while this helps, it doesn’t change the game rules. The estimator is still asymptotically normal, and the convergence rate remains \(O(1/\sqrt{n})\).

However, it is statistically better. The asymptotic variance (the spread of the error) decreases. As shown below, the variance of the preference-based estimator (\(V_{SP}\)) is “smaller” (in the matrix sense) than the sample-only variance (\(V_{SO}\)).

The Takeaway: With noisy preferences, you get a sharper estimate, but you are still stuck in the “slow” lane of \(1/\sqrt{n}\) convergence.
4. Scenario B: Deterministic Preferences (The Breakthrough)
The most striking contribution of this paper is the analysis of Deterministic Preferences. When preferences are noise-free, they stop acting like soft probabilistic suggestions and start acting like hard constraints.
If you know for a fact that sample \(X_i\) is preferred to \(Y_i\), then the true parameter \(\theta^*\) must lie in the region of space that satisfies this condition.
The Feasible Set
We can define a set \(\mathcal{C}_n\) containing all parameters \(\theta\) that correctly classify every observed preference pair in our dataset. This implies minimizing the 0-1 loss (perfect classification).

Any parameter outside this set is strictly impossible. The researchers propose the Deterministic Preferences MLE (DP MLE), which optimizes the likelihood subject to these hard constraints.

Why is this faster?
Think about estimating the maximum value \(\theta\) of a uniform distribution from samples \([0, \theta]\). The average of samples converges at \(1/\sqrt{n}\). But the maximum of the observed samples converges to \(\theta\) at a rate of \(1/n\). Hard boundaries provide much more information than averages.
Because deterministic preferences slice the parameter space with hard cuts, the set of feasible parameters \(\mathcal{C}_n\) shrinks rapidly. The researchers prove that any estimator residing within this feasible set (including DP MLE) converges at the accelerated rate of \(O(1/n)\).

This is a fundamental shift. In high-dimensional settings (\(d > 1\)), the rate includes logarithmic factors and dimension dependence, but the dependence on sample size \(n\) remains \(1/n\).

5. The Geometry of Acceleration
To prove this \(1/n\) rate, the paper analyzes the geometry of the constraints.
Every time we get a preference pair \((X, Y)\), it creates a “cut” in the parameter space. The true parameter \(\theta^*\) is on one side of the cut. The distance from the boundary of this cut to \(\theta^*\) tells us how informative that sample is.
The researchers define an “informative sample” set \(\mathcal{G}_1\) and a scaling factor \(V_{\theta^*, u}\) which quantifies how much a specific observation helps restrict the parameter space along a direction \(u\).

The error of the estimator is bounded by the “worst” direction in the feasible set. This effectively behaves like the minimum of a set of random variables. Since the minimum of random variables (with positive density at zero) scales as \(1/n\), the error of the estimator does too.

6. Is O(1/n) the Limit?
Is it possible to go even faster? The paper provides a lower bound using information-theoretic techniques (Assouad’s Lemma and Hellinger distance).

This proves that the \(O(1/n)\) rate is indeed minimax optimal (up to constants and log factors). You generally cannot estimate a parameter faster than \(1/n\) using this type of feedback.
7. Experimental Validation
The theory is compelling, but do the experiments match? The researchers tested these estimators on Gaussian distributions with log-probability rewards.
The results, shown in Figure 1, are visually distinct.
- Blue Line (SO): Sample-Only MLE.
- Light Blue (SP sto): Stochastic Preferences.
- Pink/Magenta (DP): Deterministic Preferences.

In the log-log plot above:
- Slope matters: The slope of the line corresponds to the exponent in the convergence rate.
- The SO (Sample-Only) and SP (Stochastic) lines have a slope of roughly -0.5, corresponding to \(n^{-0.5}\) or \(1/\sqrt{n}\).
- The DP (Deterministic) line is much steeper, with a slope closer to -1.0, corresponding to \(n^{-1}\) or \(1/n\).
This empirical evidence perfectly aligns with the theoretical derivation: deterministic constraints radically accelerate learning.
The Covariance Gap
We mentioned earlier that stochastic preferences (SP) offer a modest improvement over samples (SO). The experiments quantify this gap. The figure below shows the difference in asymptotic covariance matrices. While SP is better than SO (the values are positive), the gap is relatively small and vanishes as the dimension \(d\) increases.

This reinforces that the “big win” in preference learning comes from the constraints (deterministic regime) rather than just the variance reduction in the stochastic regime.
8. Conclusion and Implications
This paper provides a rigorous statistical foundation for why preference-based learning is so effective.
- Preferences vs. Samples: If preferences are noisy, they act like standard data, improving accuracy (variance) but not speed (convergence rate).
- The Power of Constraints: If preferences are deterministic (or treated as such via strict losses), they act as geometric constraints. This changes the learning regime from averaging (slow, \(1/\sqrt{n}\)) to cutting (fast, \(1/n\)).
While real-world preferences (like human feedback on LLM outputs) are rarely perfectly deterministic, this theory suggests that methods treating preferences as hard constraints or utilizing high-confidence feedback signals might unlock faster learning rates than those that treat preferences purely as noisy probabilistic signals.
By moving “beyond M-estimators” and leveraging the geometry of the feasible set, we can extract significantly more information from the same amount of data—a crucial insight for training the next generation of efficient AI models.
](https://deep-paper.org/en/paper/2505.23557/images/cover.png)