Introduction

In the era of Large Language Models (LLMs), we are obsessed with benchmarks. We look at massive leaderboards and see that a model achieves “85% accuracy on MMLU” or “90% on HellaSwag.” These aggregate numbers give us a general sense of capability, but they often hide a critical problem: models are not equally good at everything.

A practitioner often cares about specific, granular topics. You might not care about general “Law,” but you care deeply about “Intellectual Property Law in the 19th Century.” The problem is data scarcity. While we have thousands of questions for broad categories, niche subgroups might only have ten or twenty examples available for testing.

How do you accurately estimate a model’s performance on a topic when you only have a handful of test questions?

If you simply average the results (e.g., the model got 7 out of 10 right, so it’s 70% accurate), your estimate is plagued by high variance. One lucky guess or one tricky question swings the score wildly. Conversely, if you rely on the model’s general performance to guess its niche performance, you introduce bias—just because a model is good at general history doesn’t mean it’s good at niche legal history.

In the paper “Precise Model Benchmarking with Only a Few Observations,” researchers from Amazon Web Services and UC Berkeley propose a robust statistical solution to this dilemma: an Empirical Bayes (EB) estimator. By intelligently combining observed data with predictive modeling, this method allows us to benchmark models precisely, even when data is scarce.

The Problem: The Variance vs. Bias Trade-off

To understand the solution, we first need to formalize the problem. We have a dataset split into subgroups (topics, domains, or tasks). Let’s say we have a subgroup \(g\) (e.g., “High Jump” questions). We want to know the true performance \(\mu_g\) of our model on this subgroup.

However, we only have a small set of observations. The standard way to measure performance is the Direct Estimator (DT).

The Direct Estimator (DT)

The Direct Estimator is what most of us use intuitively: we take the average accuracy of the questions in the subgroup. If there are 10 questions and the model answers 6 correctly, the DT is 0.6.

The issue is variance. When the sample size (\(n_g\)) is small, the DT is unstable.

Estimates of LLM accuracy and their 95% confidence intervals for predictions made by Gemma-2b across various subgroups on a subset of HellaSwag.

Look at Figure 1 above. The solid red circles represent the Direct Estimator. Notice the wide error bars (vertical lines) for categories like “Running a marathon.” Because the sample size is small, the confidence interval is huge. We can’t say with certainty whether the model is actually good or just got lucky.

Synthetic Regression (SR)

The alternative is Synthetic Regression (SR). This approach assumes that performance on related topics is correlated. We can train a regression model (like XGBoost) that looks at the features of the subgroup (e.g., text embeddings of the questions) and predicts the model’s accuracy.

This reduces variance because the regression model learns from the entire dataset, not just the small subgroup. However, it introduces bias. As seen in Figure 1 (the dashed red crosses), the SR estimates might be stable, but if the regression model imperfectly captures the nuance of a specific task (like “High jump”), the estimate will be consistently wrong (biased).

The goal of the paper is to minimize the Mean Squared Error (MSE), which accounts for both bias and variance:

Equation for average Mean Squared Error across all subgroups.

The researchers needed a method that minimizes this error by finding the “sweet spot” between the noisy Direct Estimator and the potentially biased Synthetic Regression.

The Solution: Empirical Bayes (EB)

The researchers propose an Empirical Bayes (EB) estimator. Conceptually, Empirical Bayes is a “shrinkage” method. It starts with the noisy Direct Estimator and “shrinks” it toward the stable Synthetic Regression estimate.

The magic lies in how much it shrinks. It doesn’t just average the two; it dynamically calculates a weighting factor based on how reliable the data is for that specific subgroup.

The Estimator Formula

The core of the paper is the following equation for the EB estimator, \(\hat{\mu}_g\):

The formula for the Empirical Bayes estimator.

Let’s break this down. The final estimate is a weighted sum of two components:

\(\hat{f}(X_g)\): The prediction from the regression model (SR).
\(Z_g\): The observed average accuracy (DT).

The balance is determined by the variance (\(\hat{\sigma}_g^2\)) and the heterogeneity (\(\hat{A}\)).

\(\hat{\sigma}_g^2\) (Observation Variance): This represents the noise in the direct data. If the sample size is small, this variance is high.
\(\hat{A}\) (Model Variance/Signal): This represents the variation in the true underlying performances that the regression model cannot explain.

How It Works in Practice

Think of the weighting coefficients in the equation above as a “trust mechanism.”

Case 1: Small Sample Size (High Noise). If a subgroup has very few questions, \(\hat{\sigma}_g^2\) is large. Looking at the fraction attached to the regression term \(\hat{f}(X_g)\), the large \(\hat{\sigma}_g^2\) in the numerator makes the weight close to 1. The estimator ignores the noisy observed data (\(Z_g\)) and trusts the regression model.
Case 2: Large Sample Size (Low Noise). If we have hundreds of questions for a subgroup, \(\hat{\sigma}_g^2\) becomes small. The weight on the regression term drops, and the weight on \(Z_g\) increases. The estimator trusts the hard data because it is statistically significant.
Case 3: Poor Regression Fit. If the regression model is bad at predicting this type of task, the unexplained variance \(\hat{A}\) increases. This shifts the weight back toward the direct observation \(Z_g\), protecting the estimate from the regression model’s bias.

This dynamic adjustment is what allows EB to consistently outperform the baselines. It effectively says: “Trust the data when there’s enough of it; otherwise, trust the pattern found in similar data.”

Confidence Intervals

A major contribution of this work is not just the point estimate (the single number), but the uncertainty quantification. The authors utilize robust confidence intervals adapted for this method.

Formula for the confidence interval of the Empirical Bayes estimator.

As shown in Figure 1 earlier, the EB confidence intervals (dash-dot red triangles) are significantly tighter (narrower) than the Direct Estimator’s intervals, while still maintaining accuracy. This allows practitioners to make stronger claims about model performance without needing to collect more data.

Experiments and Results

To validate this approach, the authors tested the estimators across a wide variety of datasets, including BIG-bench, HellaSwag, MMLU, and MedMCQA. They simulated data scarcity by subsampling these large datasets and then compared the estimated accuracy against the “ground truth” (calculated using the full dataset).

Reducing Mean Squared Error

The primary metric for success was the ratio of MSE compared to the Direct Estimator. A ratio less than 1.0 means the method is better than the standard approach.

Comparison of MSE ratios for Synthetic Regression and Empirical Bayes relative to the Direct Estimator.

Figure 2 paints a clear picture. The Direct Estimator (the baseline at 1.0) is consistently beaten by Empirical Bayes (red crosses).

SR (Blue Diamonds): While often better than Direct, SR fails spectacularly in some cases (points high above the line), specifically on datasets like BIG-bench where the regression model likely failed to capture the complexity of the tasks.
EB (Red Crosses): The Empirical Bayes estimates are clustered at the bottom, consistently providing the lowest error. In many cases, EB achieves 20-30% lower MSE than the standard approach.

Handling Subgroup Sizes

One of the most interesting findings is how the different methods behave depending on the size of the subgroup.

Scatter plot comparing MSE for smaller vs. larger subgroups.

Figure 3 splits the results into smaller subgroups (\(\leq 15\) questions) and larger ones (\(> 15\)).

Left (Small Subgroups): The Direct Estimator (DT) struggles here due to high variance. Synthetic Regression (SR) performs well, and EB effectively mimics SR to capture those gains.
Right (Large Subgroups): Here, SR starts to perform worse than DT. Why? Because with enough data, the Direct Estimator is very precise, so the bias introduced by SR becomes a liability. However, notice that EB (the crosses) adapts. It recognizes the data is sufficient and aligns itself with DT, maintaining low error.

Better Confidence Intervals

Precision isn’t just about the estimate; it’s about knowing how wrong you might be. Ideally, a 95% confidence interval should contain the true value 95% of the time (Coverage) and be as narrow as possible (Width).

Average coverage and width of 95% confidence intervals for DT and EB estimates.

Figure 4 compares the “Average Width” vs. “Average Coverage.”

DT (Circles): High coverage (good), but high width (bad). The intervals are wide and uninformative.
EB (Crosses): The EB intervals are significantly shifted to the left, meaning they are narrower (tighter), yet they maintain coverage levels very close to the nominal 95% (the dashed line).

This result is crucial for benchmarking. It means you can get a tighter range of plausible performance values without collecting more data.

Beyond Text: Vision and Tabular Data

While the focus is often on LLMs, the mathematical foundation of Empirical Bayes is domain-agnostic. The authors extended their experiments to Computer Vision (using CLIP models) and Tabular data.

Computer Vision Results

The researchers evaluated zero-shot accuracy on vision classification tasks (like identifying objects in CIFAR-10 or ImageNet).

Comparison of methods across computer vision datasets using CLIP models.

As shown in Figure 5, the trend holds. The Empirical Bayes estimator (Red Crosses) consistently stays below the 1.0 line, indicating it produces less error than the Direct Estimator. In contrast, the Synthetic Regression (Blue Diamonds) is volatile—sometimes helping, but often hurting performance when the visual features don’t correlate perfectly with accuracy.

Tabular Data Results

Finally, they applied the method to fairness tasks in tabular data, such as predicting income or employment based on demographic subgroups.

Comparison of methods across tabular datasets.

Figure 7 confirms the versatility of the method. Whether minimizing Mean Squared Error or Cross-Entropy, EB provides the most reliable estimates for subgroup performance.

Conclusion

Evaluating machine learning models is becoming harder as the models become more capable and the tasks more specific. We can no longer rely on a single global “accuracy” score. We need to understand performance on specific, often niche, subgroups where data is expensive or rare.

The paper “Precise Model Benchmarking with Only a Few Observations” demonstrates that we don’t always need more data to get better benchmarks—we need better statistics. By moving from a simple average (Direct Estimator) to an Empirical Bayes approach, we can:

Borrow strength from the entire dataset to stabilize estimates for small subgroups.
Avoid bias by reverting to the direct data when sample sizes are large enough.
Quantify uncertainty with tighter, more informative confidence intervals.

For students and practitioners alike, this is a valuable tool in the evaluation toolkit. Implementing an EB estimator allows you to extract more signal from your limited observations, ensuring that when you claim a model works for a specific use case, you have the statistical rigour to back it up.

Introduction#

The Problem: The Variance vs. Bias Trade-off#

The Direct Estimator (DT)#

Synthetic Regression (SR)#

The Solution: Empirical Bayes (EB)#

The Estimator Formula#

How It Works in Practice#

Confidence Intervals#

Experiments and Results#

Reducing Mean Squared Error#

Handling Subgroup Sizes#

Better Confidence Intervals#

Beyond Text: Vision and Tabular Data#

Computer Vision Results#

Tabular Data Results#

Conclusion#