Introduction

Imagine you have trained a machine learning model for a critical task—perhaps detecting tumors in medical scans or controlling a robotic arm in a factory. During training, the model seemed to perform well. But is “seeming to perform well” enough when safety is on the line?

In the real world, the gap between training performance and deployment reliability can be dangerous. To bridge this gap, we often perform calibration: selecting the right hyperparameters (settings) to ensure the model meets a strict safety standard, such as “95% accuracy on the true population.”

The standard approach to this problem has been a framework called “Learn-then-Test” (LTT). While statistically rigorous, LTT is rigid. It requires you to decide exactly how many tests to run beforehand, and it generally wastes resources testing models that are clearly failing.

What if we could be smarter? What if our testing procedure could “learn” as it goes, focusing its budget on the most promising models and stopping early when it finds a winner?

In this post, we dive deep into a new methodology called Adaptive Learn-then-Test (aLTT). This approach leverages the mathematics of “e-values” and “betting scores” to create a testing framework that is just as safe as the old methods but drastically more efficient. We will explore how it works, the math behind it, and how it performs in complex tasks like Reinforcement Learning and Prompt Engineering.

The Problem: Finding the Reliable Few

Before we fix the problem, let’s define it mathematically. We have an AI application, $\mathcal{M}_{\lambda}$, which behaves differently depending on a hyperparameter $\lambda$. We want to choose a $\lambda$ from a candidate set $\Lambda$ such that the model’s risk (error rate) is below a certain threshold $\alpha$.

The “true” risk of a model, $R(\lambda)$, is the expected loss over the entire population of data:

Population risk equation.

Our goal is to identify the set of reliable hyperparameters, $\Lambda^{\text{rel}}$, where the risk is safely below our target $\alpha$:

Reliable set definition.

The problem is that we don’t know the true distribution of data, so we can never calculate $R(\lambda)$ perfectly. We have to estimate it using finite samples. If we are careless, we might think a model is safe when it is actually dangerous (a “false discovery”).

The Baseline: Learn-then-Test (LTT)

The existing standard, LTT, treats this as a Multiple Hypothesis Testing (MHT) problem. For every candidate hyperparameter, it sets up a “null hypothesis” that the model is unsafe ($R(\lambda) > \alpha$). It then collects data to try and disprove this hypothesis.

LTT uses p-values. A p-value tells you: “If this model were actually unsafe, how unlikely is it that we would see data this good?” If the p-value is very low, we reject the null hypothesis and declare the model safe.

To visualize how this process works in practice, let’s look at a prompt engineering example:

Figure 1. An example application of aLTT to reliable prompt optimization. Stage 1 involves pre-selection of prompts. Stage 2 involves sequential evaluation. Stage 3 is the final selection.

In the standard LTT version of Figure 1 (ignoring the “adaptive” part for a moment), you would:

Gather a fixed dataset.
Test all candidate prompts on the entire dataset.
Calculate p-values at the very end.
Apply a correction (like Bonferroni or Benjamini-Hochberg) to ensure you don’t make too many false discoveries.

This guarantees statistical validity, usually controlling the Family-Wise Error Rate (FWER) or the False Discovery Rate (FDR).

FWER: The probability of letting even one unsafe model through is capped (e.g., $\leq 5\%$).
FDR: The expected proportion of unsafe models among the selected ones is capped.

While safe, LTT is non-adaptive. If you have 100 candidate models and 90 of them are terrible, LTT wastes time testing those 90 models on thousands of data points just to calculate a final p-value that says “definitely bad.”

The Innovation: Adaptive Learn-then-Test (aLTT)

The researchers propose aLTT to solve the efficiency bottleneck. The core idea is simple: don’t wait until the end. If a model looks promising, test it more. If it looks bad, stop testing it. If you have found enough good models, stop the whole experiment.

To do this mathematically, aLTT moves away from static p-values and adopts e-values and e-processes.

E-values: Testing by Betting

An e-value is a different way to measure evidence against a null hypothesis. You can think of it as a betting score.

Imagine a gambler playing against the “house” (nature).

The House Claims: This model is unsafe (the null hypothesis $H_i$).
The Gambler Believes: This model is safe.

The gambler starts with $1. As data comes in, the gambler bets on the outcome. If the model performs well (low loss), the gambler’s wealth grows. If the model performs poorly, the wealth shrinks.

If the “House” is telling the truth (the model is unsafe), the gambler cannot expect to make money in the long run. Mathematically, the expected value of an e-value $E_i$ under the null hypothesis is at most 1:

Expectation of e-value inequality.

However, if the model is actually reliable, the gambler’s wealth can grow to infinity. A high e-value (large wealth) is strong evidence that the null hypothesis is false (i.e., the model is reliable).

The E-Process

The beauty of e-values is that they handle sequential data naturally. We can multiply the outcomes of sequential bets to track wealth over time. This sequence is called an e-process.

At any time step $t$, the e-process for hyperparameter $i$ is updated based on the new data point $Z$. The update rule looks like this:

E-process update equation.

Here, $\mu$ represents the betting strategy—how much money the gambler puts on the line.

If the model is tested ($\lambda_i \in \mathcal{I}^t$) and performs well (risk $R < \alpha$), the term $(1 + \mu(\alpha - R))$ becomes greater than 1, and wealth grows.
If the model performs poorly, wealth shrinks.

This equation allows aLTT to update the “score” of every hyperparameter after every single data point.

From Betting to P-values

To integrate this with standard statistical guarantees (like FWER and FDR control), we need to convert these betting scores back into something comparable to p-values.

Because of the mathematical properties of e-processes (specifically, they are nonnegative supermartingales under the null), we can convert the running maximum of the gambler’s wealth into an anytime-valid p-value:

Anytime-valid p-value equation.

This equation is powerful. It says that at any point in time $t$, we can look at the highest wealth we have achieved so far ($max E^\tau$) and take its reciprocal. This gives us a valid p-value that we can check right now, without waiting for the experiment to end.

The aLTT Algorithm

With these tools, the Adaptive Learn-then-Test algorithm operates in a loop:

Adaptive Acquisition: Look at the current e-values. Which models are “rich” (promising)? Which are “poor”? Use an algorithm (like $\epsilon$-greedy) to select the next batch of models to test.
Test: Run the selected models on a new data point.
Update Evidence: Update the e-processes (wealth) for the tested models using the betting equation.
Check Stopping Condition: Calculate the anytime p-values. Apply a selection rule (like Benjamini-Hochberg for FDR). If we have found enough reliable models, stop early.

This approach allows aLTT to discard bad models quickly and focus the testing budget on confirming the good ones.

Experimental Results

The researchers validated aLTT across several domains. Let’s look at two of the most compelling cases: Reinforcement Learning and Prompt Engineering.

Case Study 1: Online Policy Selection for RL

In Offline Reinforcement Learning, we train agents on static datasets. However, a robot trained on static data might fail in the real world. We need to validate these policies online, but real-world interactions (like moving a physical robot) are expensive and potentially dangerous.

The researchers tested 20 different control policies for a “Half Cheetah” robot simulation. They wanted to find policies that achieved a certain reward threshold.

Efficiency Gains:

The graph below shows the True Positive Rate (TPR)—the percentage of valid policies successfully identified—over time.

Figure 2. True positive rate of LTT and aLTT. The left chart shows FWER control, and the right chart shows FDR control. aLTT (colored lines) rises much faster than LTT (black dotted line).

Black Dotted Line (LTT): It stays at zero until the very end (t=5000). It gives you nothing until the experiment is completely finished.
Colored Lines (aLTT): These rise sharply. Specifically, the solid green lines (using an $\epsilon$-greedy strategy to pick promising models) identify almost all reliable policies using only a fraction of the data.

Statistical Validity:

Does this speed come at the cost of safety? No. The researchers measured the actual error rates (FWER and FDR) to ensure they stayed below the target $\delta$ (0.1).

Figure 3. Comparison of FWER and FDR levels. The lines represent the error rates, which increase with delta but generally stay controlled.

As shown in Figure 3, the error rates behave as expected, respecting the user-defined tolerance levels. This confirms that the “betting” approach provides mathematically valid safety guarantees.

Case Study 2: Automated Prompt Engineering

Large Language Models (LLMs) are notoriously sensitive to prompts. “Automated Prompt Engineering” involves generating hundreds of candidate prompts and testing them to see which ones consistently produce the right output. Testing requires calling the LLM API, which costs money and time.

The researchers used Llama-3 models to generate and test prompts for various NLP tasks.

Discovery Speed:

Figure 4. True positive rate for prompt engineering. aLTT (green/shaded) climbs rapidly compared to the flat line of LTT.

Figure 4 mirrors the RL results. aLTT identifies reliable prompts almost immediately. For example, with $\epsilon=0.25$ (a strategy that heavily favors testing promising prompts), aLTT finds 50% of the reliable prompts within the first 1,000 rounds. The non-adaptive strategy (dashed line) is far slower.

Quality of Selected Prompts:

Here is a fascinating secondary benefit. Because aLTT is so efficient, it finds more reliable prompts than LTT given a fixed budget. This larger pool of winners allows for better post-selection optimization.

In this experiment, the goal was to find the shortest reliable instruction (shorter prompts save tokens and money).

Figure 5. Length of the shortest instruction found. aLTT consistently finds shorter prompts (lower on the y-axis) across different accuracy targets.

In Figure 5, the y-axis is the length of the best prompt found. Lower is better. The solid green line (aLTT) is consistently lower than the others. Because aLTT didn’t waste time testing garbage prompts, it had enough budget to verify a wider variety of good prompts, eventually finding shorter, more efficient ones.

The Impact of Betting Strategies

One final nuance: how the “gambler” bets matters. The researchers compared different betting strategies, such as “Unit Bet” (betting a constant amount) vs. “aGRAPA” (an adaptive strategy that optimizes the bet size based on history).

Figure 6. TPR of aLTT under different betting strategies. aGRAPA (blue squares) generally performs best.

As Figure 6 shows, smart betting strategies like aGRAPA (blue squares) yield higher True Positive Rates faster than naive strategies like Unit Bet (black crosses). This highlights that the “wealth” in the algorithm isn’t just a metaphor—optimizing wealth growth directly correlates to finding good models faster.

Conclusion

The transition from AI development to deployment is fraught with risks. We cannot simply trust that a model trained on a dataset will behave safely in the wild. We must test it.

However, safety shouldn’t require infinite resources. The Adaptive Learn-then-Test (aLTT) framework demonstrates that we can have our cake and eat it too. By abandoning the rigid structure of traditional p-value testing and embracing the dynamic, sequential nature of e-processes, we can:

Stop Early: Quit testing once we have enough reliable models.
Adapt: Focus our testing budget on the models that actually show promise.
Guarantee Safety: Maintain strict FWER or FDR control.

Whether it is finding the perfect prompt for an LLM or ensuring a robot doesn’t crash, aLTT offers a mathematically sound path to efficient and reliable AI calibration. It turns the calibration process from a blind data-gathering exercise into a strategic game of betting—one where the odds are rigorously calculated to ensure the house (safety) always wins.

Introduction#

The Problem: Finding the Reliable Few#

The Baseline: Learn-then-Test (LTT)#

The Innovation: Adaptive Learn-then-Test (aLTT)#

E-values: Testing by Betting#

The E-Process#

From Betting to P-values#

The aLTT Algorithm#

Experimental Results#

Case Study 1: Online Policy Selection for RL#

Case Study 2: Automated Prompt Engineering#

The Impact of Betting Strategies#

Conclusion#