Introduction

In the world of machine learning, we are often tasked with optimizing “black-box” functions—functions that are expensive to evaluate, have no known gradients, and are essentially mysterious boxes where you put in an input \(x\) and get out a noisy output \(y\). This is the domain of Bayesian Optimization (BO).

If you have studied BO, you know there is a bit of a divide in the community regarding Acquisition Functions (AFs)—the mathematical rules that decide where to sample next.

On one side, we have Expected Improvement (EI). It is the workhorse of BO. It is simple, computationally cheap, and aggressive. It asks: “How much better is this point likely to be compared to the best point I’ve found so far?”

On the other side, we have Information-Theoretic functions, such as Entropy Search (ES) and Max-value Entropy Search (MES). These are the sophisticated strategists. They ask: “How much information does this point give me about the global optimum?”

For years, these two families were viewed as fundamentally distinct philosophies: one focuses on value improvement, the other on information gain.

But what if they are actually the same thing?

In the paper A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization, researchers present a groundbreaking perspective. They utilize Variational Inference to prove that Expected Improvement is actually just a special, approximate case of Max-value Entropy Search.

By identifying this link, they don’t just solve a theoretical puzzle; they build a bridge. They introduce a new acquisition function, VES-Gamma, that balances the best of both worlds, leading to state-of-the-art performance in high-dimensional optimization tasks.

In this post, we will tear down the wall between EI and Entropy Search, derive the new framework step-by-step, and see how a Gamma distribution can supercharge your optimization strategy.


Background: The Tale of Two Families

To understand the unification, we first need to define the two players involved.

1. The Pragmatist: Expected Improvement (EI)

Expected Improvement is arguably the most popular acquisition function. It relies on a Gaussian Process (GP) surrogate model to predict the mean and variance of the objective function at unobserved points.

The logic is straightforward: given the best value we have observed so far (\(y^*_t\)), we calculate the expectation of how much a new point \(x\) will exceed that value.

Equation for Expected Improvement.

This formula balances exploration (high variance regions might yield a huge improvement) and exploitation (high mean regions are “safe” bets).

2. The Strategist: Information-Theoretic AFs

While EI looks for immediate gains, Information-Theoretic methods play the long game. Their goal is to reduce the uncertainty (entropy) regarding the location or value of the global maximum.

The original Entropy Search (ES) tries to reduce the entropy of the optimal set \(x^*\). However, calculating this is computationally incredibly heavy.

Equation for Entropy Search.

Later, Predictive Entropy Search (PES) improved efficiency, but the real breakthrough for tractability came with Max-value Entropy Search (MES). Instead of looking for the location \(x^*\), MES focuses on the maximum value \(y^*\). It seeks the point \(x\) that maximizes the mutual information between the observation \(y_x\) and the global maximum value \(y^*\).

The MES formulation looks like this:

Equation for Max-value Entropy Search (MES).

Here is the catch: The second term in the equation above involves an expectation over the global maximum \(y^*\). This term is “non-closed-form,” meaning we cannot calculate it directly. We usually have to approximate it via sampling or heuristics, which can be inaccurate or slow.

The Conflict

Historically, EI and MES were treated as different beasts. EI is a geometric heuristic; MES is about information theory. The authors of this paper challenge this view. They argue that if we look at MES through the lens of Variational Inference, we can derive EI directly from it.


The Core Method: Variational Entropy Search (VES)

The core contribution of this work is the Variational Entropy Search (VES) framework. To understand it, we need a quick refresher on Variational Inference (VI).

In Bayesian statistics, when a posterior distribution is too hard to calculate (like the term in MES), we approximate it with a simpler distribution, \(q(z)\), from a specific family (like Gaussians or Exponentials). We then try to make \(q(z)\) as close as possible to the true distribution \(p(z|\tilde{x})\) by maximizing a quantity called the Evidence Lower Bound (ELBO).

General definition of the Evidence Lower Bound (ELBO).

The Entropy Search Lower Bound (ESLBO)

The researchers applied this VI logic to the MES acquisition function. They derived a lower bound for MES, which they call the ESLBO.

Instead of struggling to calculate the exact entropy reduction, they maximize this lower bound:

Derivation of the Entropy Search Lower Bound from MES.

By ignoring the constant terms that don’t depend on our choice of \(x\), we get the clean definition of the ESLBO:

Definition of the ESLBO.

Here, \(q(y^* | \mathcal{D}_t, y_x)\) is our “variational distribution.” It is a guess we make about what the distribution of the global maximum value \(y^*\) looks like, given our current data and a new potential sample.

This is the pivotal moment in the paper. The choice of the distribution \(q\) determines the behavior of the acquisition function.

Visualizing the Framework

The figure below perfectly encapsulates the VES framework.

  1. Left: We have a Gaussian Process with some data points (crosses) and a potential new sample (red star).
  2. Right: We are trying to approximate the distribution of the global maximum \(y^*\) (the blue curve).
  • If we approximate it with an Exponential distribution (green dashed line), we get EI.
  • If we approximate it with a Gamma distribution (red line), we get a more flexible, powerful method.

Illustration of the VES framework comparing true distribution, exponential approximation, and gamma approximation.

Let’s break down these two choices.


1. VES-Exp: The Theoretical Bridge (Recovering EI)

The authors asked a fascinating question: What if we restrict our variational family \(Q\) to be Exponential distributions?

They defined the variational density \(q\) as an exponential distribution starting at the current best observed value:

Exponential variational density definition.

When you plug this specific \(q\) into the general ESLBO equation derived earlier, the math collapses into something very familiar.

ESLBO under exponential assumption equals log lambda minus constant plus Expected Improvement.

Look closely at the final term: It is exactly the Expected Improvement (EI).

The first two terms depend on \(\lambda\), but for a fixed \(\lambda\), maximizing this bound is mathematically identical to maximizing EI. This proves Theorem 3.2:

Expected Improvement is effectively a Variational Approximation of Max-value Entropy Search where the posterior of the optimum is assumed to be Exponential.

This changes how we view EI. It is not just a heuristic; it is an information-theoretic method that makes a very rigid assumption (exponentiality) about the unknown maximum.

2. VES-Gamma: The Evolution

The exponential distribution is simple, but as we saw in the visualization (Figure 1), it is often wrong. The true distribution of the global maximum (\(p(y^*)\)) is rarely monotonic. It usually starts low, rises to a peak, and then tails off—a shape the exponential curve cannot match.

To fix this, the authors propose VES-Gamma. They replace the Exponential distribution with a Gamma distribution, which generalizes the exponential but adds a shape parameter \(k\).

Gamma variational density definition.

This flexibility allows the approximation to have a “hump” (when \(k > 1\)), fitting the true posterior much better. When we plug this Gamma distribution into the ESLBO, we get the VES-Gamma Acquisition Function:

The full VES-Gamma acquisition function equation.

This equation is a beauty.

  • The last term is EI (scaled by \(\beta\)).
  • The other terms (involving \(k\) and logs) act as regularizers or “correction terms” based on information-theoretic principles.
  • If \(k=1\), this collapses back to EI.

Auto-Tuning Hyperparameters

One challenge with VES-Gamma is determining the parameters \(k\) and \(\beta\). The paper proposes an auto-determination method. For every candidate point \(x\), they solve for the optimal \(k\) and \(\beta\) that maximize the lower bound.

This involves solving a specific equation involving the digamma function \(\psi(k)\):

Equation relating log k and digamma to expectations.

As shown below, the function \(\log k - \psi(k)\) is monotonic, ensuring that a unique solution exists.

Plot of log k minus psi(k) showing strictly decreasing behavior.

This auto-tuning makes VES-Gamma parameter-free for the user. It dynamically adjusts the balance between “pure EI” behavior and “information-seeking” behavior based on the landscape of the function.


Experiments & Results

Does this unified theory translate to better optimization performance? The authors tested VES-Gamma against standard EI and MES across synthetic functions, GP samples, and real-world problems.

Verification: Is VES-Exp really EI?

First, they had to prove their theory. They ran an optimization using standard EI and their derived “VES-Exp.”

Comparison traces of VES-Exp and EI showing they are nearly identical.

As shown in the traces above, the two methods behave almost identically. The researchers confirmed this using a Kolmogorov-Smirnov (KS) test, showing statistically significant similarity. The minor deviations are due to numerical approximations (Monte Carlo sampling) used in VES, whereas EI uses a closed-form equation.

Performance: Synthetic Benchmarks

On standard test functions (Branin, Levy, Hartmann, Griewank), VES-Gamma (blue triangles) consistently performs at the top of the pack.

Results on synthetic benchmarks. VES-Gamma performs best on Branin and Hartmann.

It is particularly strong on the Hartmann (6D) function, where it significantly outperforms MES. This suggests that the Gamma approximation captures the uncertainty of the optimum much better than the standard approximations used in MES or the implicit exponential assumption of EI.

Performance: GP Samples (High Dimensionality)

The strength of VES-Gamma becomes glaringly obvious in high-dimensional spaces sampled directly from a GP prior (dimensions up to 100).

Results on 100-dimensional GP samples. VES-Gamma dominates at lower length scales.

In the plots above, look at the difference when the lengthscale (\(l\)) is small (top left). A small lengthscale implies a “wiggly,” complex function. Here, standard EI and MES struggle to make progress, likely getting stuck in local optima. VES-Gamma, however, continues to improve the objective value significantly.

Performance: Real-World Benchmarks

Finally, the authors tested on real-world engineering and ML tuning problems:

  • Rover: Trajectory optimization (60D).
  • Mopta08: Vehicle design (124D).
  • Lasso-DNA: Sparse regression (180D).
  • SVM: Hyperparameter tuning (388D).

Results on real-world benchmarks. VES-Gamma is superior on SVM and competitive elsewhere.

On the SVM benchmark (bottom right), VES-Gamma is the clear winner, finding much better hyperparameters than EI or MES. On the other tasks, it remains highly competitive, often matching the best-performing baseline.

The Cost of Sophistication

There is no free lunch. Because VES-Gamma involves an inner optimization loop to find the best \(k\) and \(\beta\) parameters for every candidate point, it is computationally more expensive than the closed-form EI.

Table showing runtime comparison. VES is significantly slower per iteration.

As shown in Table 2, VES takes about 10x longer per iteration than EI. However, in Bayesian Optimization, the objective function (e.g., training a neural network or running a physical simulation) often takes minutes or hours. In that context, spending an extra 10 seconds to choose a better sampling point is a negligible cost for a potentially massive reduction in the number of required function evaluations.


Conclusion & Implications

This paper provides a satisfying unification of two disparate subfields in Bayesian Optimization.

  1. Theoretical Unification: By viewing Expected Improvement as a variational approximation of Entropy Search (specifically using an exponential posterior), the authors demystified the relationship between value-based and information-based optimization.
  2. Practical Innovation: The VES-Gamma acquisition function capitalizes on this insight. By using a more flexible Gamma distribution, it adapts to the problem geometry better than EI (which is too rigid) and MES (which relies on difficult approximations).

The implications are exciting. Now that the framework is established, future research could explore even more complex variational families beyond Gamma distributions, potentially unlocking even more efficient optimization algorithms for high-dimensional, expensive black-box functions.

For students and practitioners, the takeaway is clear: Don’t view EI and Entropy Search as enemies. They are part of the same family, and understanding their connection allows us to build better tools for solving the hardest optimization problems.