Imagine you are running a clinical trial for a new drug, or perhaps testing a new feature on a high-traffic e-commerce website. In the traditional “A/B testing” world, you might flip a coin: 50% of people get the treatment, 50% get the control. You run this for a month, collect data, and analyze the results.

But what if, halfway through the month, the data starts hinting that the treatment group has much higher variance than the control group? Or what if certain subgroups of people react differently? A fixed 50/50 split is rarely the most efficient way to estimate the truth. It wastes samples and time.

This brings us to Adaptive Experimental Design. Instead of fixing the rules before the experiment starts, we update the probability of assigning a treatment on the fly, based on the data we’ve seen so far. The goal is not necessarily to find the “best” arm (like in multi-armed bandits), but to estimate the Average Treatment Effect (ATE) with the highest possible precision (lowest variance).

In a recent paper, Stronger Neyman Regret Guarantees for Adaptive Experimental Design, researchers Noarov, Fogliato, Bertran, and Roth propose powerful new algorithms that make these adaptive experiments significantly more efficient. They introduce methods that converge faster than previous state-of-the-art approaches and handle complex, overlapping subgroups of populations.

In this post, we will break down their contributions, explain the math behind “Neyman Regret,” and explore how “Sleeping Experts” can help us run better experiments.

The Objective: Estimating the ATE

The core of causal inference in experiments is the Average Treatment Effect (ATE). In an ideal world, for every person \(t\), we would know what happens if they took the drug (\(y_t(1)\)) and what happens if they didn’t (\(y_t(0)\)). The ATE is simply the average difference between these two potential outcomes across the population.

The formula for the Average Treatment Effect (ATE).

However, we have a fundamental problem: we can’t observe both outcomes for the same person. We only see the outcome for the treatment actually assigned (\(Z_t\)). To estimate the ATE without bias, we use the Inverse Propensity Weighting (IPW) estimator.

The adaptive IPW estimator formula.

Here, \(p_t\) is the probability that unit \(t\) was assigned the treatment. By dividing the observed outcome by this probability, we correct for the fact that we might be assigning treatment more or less frequently over time.

The Metric: Neyman Regret

How do we measure if our adaptive strategy is “good”? We compare it to the best fixed strategy we could have chosen in hindsight.

If we knew all potential outcomes beforehand, we could calculate the exact fixed probability \(p^*\) (e.g., assign treatment 60% of the time) that minimizes the variance of our estimator. Since we don’t know the future, our adaptive algorithm tries to learn outcome patterns and adjust \(p_t\) to approach that optimal variance.

The cost of not knowing the optimal fixed design immediately is called Neyman Regret. It is the difference between the variance of our adaptive design and the variance of the best non-adaptive design.

The definition of Neyman Regret.

If our Neyman Regret grows slower than the number of samples (\(T\)), our design is “sublinear,” meaning it becomes asymptotically as efficient as the best fixed design.

Contribution 1: Breaking the Speed Limit with ClipOGD\(^SC\)

Prior to this work, the best adaptive algorithm for this setting was ClipOGD (Clipped Online Gradient Descent), which achieved a regret of \(\tilde{O}(\sqrt{T})\). The researchers asked: can we do better?

The answer is yes, provided we make a natural assumption: the outcomes are bounded and have some variance (they aren’t all zero).

Assumptions on potential outcomes: bounded magnitude and lower-bounded variance.

Strong Convexity

The key insight involves the mathematical property of Strong Convexity. The function determining the variance of the estimator is convex. Under the assumptions above, the researchers prove it is actually strongly convex.

In optimization terms, a standard convex function looks like a wide bowl, where finding the bottom can take a while (\(\sqrt{T}\)). A strongly convex function looks like a steep valley, where gravity pulls you to the bottom much faster.

By modifying the learning rate of the original ClipOGD algorithm to exploit this geometry, the researchers developed ClipOGD\(^{SC}\). This new method achieves a regret bound of \(\tilde{O}(\log T)\). This is an exponential improvement over the previous \(\sqrt{T}\) standard.

The Neyman Regret bound for ClipOGD SC is logarithmic in T.

Visualizing the Improvement

Does this theoretical speedup matter in practice? Absolutely.

The researchers tested both the old method (ClipOGD\(^0\)) and the new method (ClipOGD\(^{SC}\)) on synthetic data with varying noise levels (\(\sigma\)).

Comparison of treatment probabilities and Neyman regret between ClipOGD^0 and ClipOGD^SC.

In the bottom row of Figure 1, look at the Neyman Regret (y-axis). The orange line (the old method) stays high or decreases slowly. The blue line (ClipOGD\(^{SC}\)) plummets toward zero rapidly. This means the algorithm figures out the optimal treatment ratio very quickly and sticks to it, minimizing estimation error.

Contribution 2: Contextual Experiments and “Sleeping Experts”

The first contribution dealt with “vanilla” experiments where units are indistinguishable. But in reality, experimental units have features (covariates). In a medical trial, patients have ages, genders, and medical histories. In a web test, users have locations and device types.

We want an algorithm that guarantees efficiency not just for the whole population, but for every subgroup defined by these features.

The Challenge of Overlapping Groups

Suppose you want your experiment to be efficient for “Women,” “People over 50,” and “People with Diabetes.” A single patient might belong to all three groups. Optimizing the variance for “Women” might require a treatment probability of 0.4, while “People over 50” requires 0.7.

The researchers introduce Multigroup Neyman Regret to solve this.

Definition of Multigroup Neyman Regret.

This metric demands that the adaptive design performs as well as the best fixed design for every group simultaneously, considering only the time steps where members of that group appeared.

The Solution: Sleeping Experts

To achieve this, the authors utilize a concept from online learning called Sleeping Experts.

Imagine you have a panel of experts. Each expert is responsible for one group (e.g., the “Diabetes Expert”).

  1. When a patient arrives, we check their features.
  2. If the patient has diabetes, the Diabetes Expert “wakes up” and suggests a treatment probability. If the patient is also over 50, the “Over 50 Expert” also wakes up and makes a suggestion. Experts for irrelevant groups stay asleep.
  3. The algorithm aggregates the advice from all awake experts to make a final decision.
  4. After observing the outcome, we evaluate how well each awake expert predicted the variance gradient and update their “credibility” weights.

The researchers propose an algorithm called MGATE (Multi-Group ATE). It combines the strong gradient descent updates of ClipOGD\(^{SC}\) with a “scale-free” sleeping experts aggregation strategy.

Theoretical analysis proves that MGATE achieves a regret of \(\tilde{O}(\sqrt{T})\) for every group simultaneously.

The regret bound for the Multigroup Adaptive Design (MGATE).

MGATE in Action

To validate the contextual approach, the authors tested MGATE on real-world microfinance data. They created overlapping groups based on the data and compared the regret of MGATE against non-contextual baselines.

Group-conditional Neyman regret comparison.

In Figure 3, the red line represents MGATE. Notice that for every group (Group 0, 1, and 2), MGATE achieves the lowest or near-lowest regret. The non-contextual methods (blue and orange) might get lucky on one group but fail on others because they treat the entire population as a monolith. MGATE adapts to the specific variance structure of each sub-population.

Real-World Application: LLM Benchmarking

One fascinating application of this work is benchmarking Large Language Models (LLMs). Evaluating LLMs is expensive and time-consuming. We want to estimate their accuracy (ATE between the model and a baseline) with as few samples as possible.

The researchers applied ClipOGD to datasets like BigBench and MMLU.

Treatment probabilities and variance on LLM benchmarking data.

In Figure 4, we see the results on LLM benchmarks. The middle row is crucial: it shows the Variance. The blue line (ClipOGD\(^{SC}\)) consistently achieves lower variance faster than the baseline. This translates to needing fewer examples to get a statistically significant score for an AI model—saving compute and time.

Conclusion

This research marks a significant step forward for adaptive experiments. By mathematically proving that we can exploit the strong convexity of the variance objective, the authors provided a method (ClipOGD\(^{SC}\)) that converges exponentially faster than previous approaches. Furthermore, by integrating the “Sleeping Experts” framework, they created MGATE, a versatile tool that ensures experiments are efficient across complex, overlapping subpopulations.

For data scientists and researchers, this implies that moving from static A/B tests to adaptive designs is not just a theoretical curiosity—it is a practical way to get more precise answers with less data.

For full mathematical proofs and algorithm details, refer to the original paper: “Stronger Neyman Regret Guarantees for Adaptive Experimental Design” by Noarov et al.