The Power of Patience: How Blocking Updates Can Solve Delayed Bandit Feedback

Imagine you’re running an online advertising campaign. Every time a user visits a website, you must decide which ad to show them. Your goal is to maximize clicks, but you face two major challenges. First, you don’t know beforehand which ad a user will like—you only find out whether they clicked after the ad is shown. This is called bandit feedback: you only get information about the action you took, not the others you could have taken. Second, feedback can be delayed. A user might not click immediately, and the report confirming a click might take minutes, hours, or even days to arrive.

How can you make optimal decisions now when the feedback for your most recent actions hasn’t arrived, and the feedback you do have reflects actions taken long ago? This is the central problem of Bandit Convex Optimization (BCO) with delayed feedback—a model for many real-world sequential decision-making tasks, from online routing to financial trading.

For years, researchers developed algorithms for this problem, yet a substantial gap persisted between the upper bound (the best proven performance of algorithms) and the lower bound (theoretical best possible performance). A recent paper, Improved Regret for Bandit Convex Optimization with Delayed Feedback, introduces a new algorithm that nearly closes this gap. Its key insight is elegantly simple: sometimes, the best strategy is to be patient—update less frequently. By using a blocking mechanism, the authors’ algorithm, Delayed Follow-The-Bandit-Leader (D-FTBL), achieves a near-optimal performance guarantee, demonstrating that a little patience can go a long way when facing uncertainty and delays.

In this post, we’ll explore the intuition behind this result: how blocking transforms the painful problem of delayed feedback into one that’s tractable and even optimal.

Background: The Landscape of Online Optimization

To understand the contribution, let’s build the concepts step-by-step: Online Convex Optimization (OCO), the bandit setting, and the complication of delays.

Online Convex Optimization (OCO)

At its heart, online learning is a repeated game between a player (our algorithm) and an adversary (the environment). Over a sequence of rounds $T$:

At each round $t$, the player chooses an action $ \mathbf{x}_t $ from a set of possible actions $ \mathcal{K} $.
The adversary reveals a convex loss function $ f_t(\cdot) $.
The player suffers a loss $ f_t(\mathbf{x}_t) $.
The player receives feedback about $ f_t $ to inform the next decision.

The goal is not perfection in any individual round but to minimize the total regret, the difference between the player’s cumulative loss and the loss of the best single fixed action chosen in hindsight.

The mathematical definition of regret. It’s the difference between the player’s total loss and the loss of the best single fixed action.

Regret quantifies how much worse the algorithm performs compared to the best fixed action in hindsight.

In standard OCO, the player gets full information after each round—they can access the entire function $ f_t $ or its gradient $ \nabla f_t(\mathbf{x}_t) $. This enables efficient algorithms like Online Gradient Descent (OGD), achieving optimal regret that grows slowly with $T$, typically as $O(\sqrt{T})$.

The Bandit Challenge: Learning with One Hand Tied Behind Your Back

Bandit Convex Optimization (BCO) is the more difficult version of this game. Here, the player receives minimal information: only the loss value for the specific action taken, $ f_t(\mathbf{x}_t) $.

How can one perform gradient descent without knowing the gradient? The breakthrough came from Flaxman et al. (2005), who introduced the one-point gradient estimator. They estimated the gradient of a smoothed version of the loss function rather than the function itself.

First, define a smoothed function $ \hat{f}_{\delta}(\mathbf{x}) $ by averaging $ f $ around $ \mathbf{x} $ within a small ball of radius $ \delta $:

The definition of a delta-smoothed function. It’s the expected value of the function when you add a small random perturbation to the input.

Smoothing transforms a non-differentiable function into one whose gradient can be estimated using random perturbations.

The magic is that the gradient of this smoothed function can be estimated using just a single function value. If you sample a random direction $ \mathbf{u} $ on a unit sphere, you can construct an unbiased estimate of $ \nabla \hat{f}_{\delta}(\mathbf{x}) $:

$The one-point gradient estimator. It uses a single function evaluation \$ f(x + \\delta u) \$ to create an unbiased estimate of the gradient of the smoothed function.$

This trick allows gradient-descent-like updates using only one function evaluation per round.

This enables algorithms such as Bandit Gradient Descent (BGD) to operate without access to full gradients. However, there’s a trade-off: the estimator has high variance, and the quality of the approximation depends on the dimension $ n $. Thus, standard BCO typically achieves a regret of $O(\sqrt{n}T^{3/4})$—worse than the $O(\sqrt{T})$ for full-information OCO.

The Delay Dilemma

Now introduce delays. Feedback for the loss $ f_t(\mathbf{x}_t) $ doesn’t arrive immediately—it appears only after an arbitrary number of rounds, $d_t$. This means decisions are made based on outdated information.

Previous algorithms, like GOLD and its improved variants, tackled this by using the oldest available feedback at every step. These methods achieved a regret bound of approximately $O(\sqrt{n}T^{3/4} + (n\bar{d})^{1/3}T^{2/3})$, where $ \bar{d} $ is the average delay.

Let’s unpack this:

Delay-independent part $O(\sqrt{n}T^{3/4})$: matches non-delayed BCO, excellent performance.
Delay-dependent part $O((n\bar{d})^{1/3}T^{2/3})$: problematic, since the known theoretical lower bound is $ \Omega(\sqrt{dT}) $, where $d$ is the maximum delay.

This revealed a wide gap—algorithms weren’t delay-efficient.

The new paper’s innovation centers on closing that gap.

The Core Idea: Decoupling Delay and Bandit Noise Through Blocking

The authors identified the culprit behind poor performance: a damaging coupling between delay and bandit noise.

Think intuitively:

The one-point estimator has high variance, proportional to $n^2/\delta^2$.
Delays mean updates rely on stale feedback. The longer $d$, the more outdated the gradients.
Combining old information with high-variance noise amplifies error. This interaction leads to the unwieldy $O((n\bar{d})^{1/3}T^{2/3})$ term.

Their remedy is disarmingly simple: update less frequently.

Instead of updating every round, the algorithm groups the $T$ rounds into equal-sized blocks of $K$ rounds. Within each block, it plays around a fixed “preparatory” action $ \mathbf{y}_m $, exploring using random perturbations $ \mathbf{x}_t = \mathbf{y}_m + \delta \mathbf{u}_t $. Then it updates only once per block, aggregating all feedback received during those rounds.

This blocking strategy produces two crucial effects:

1. Variance Reduction

Summing $K$ noisy gradient estimates within a block reduces variance dramatically. With a proper block size $K$, the expected norm of the aggregated gradient behaves as $O(K)$, instead of $O(Kn/\delta)$.

The expected norm of the aggregated gradient in a block. With a proper block size, this becomes proportional to K, not the much larger K*n/delta.

By aggregating gradients within a block, variance scales linearly with block size K, removing dependence on exploration radius.

2. Delay Amortization

Delays now occur at the block level. Even if the feedback for the final action in a block arrives $d$ rounds later, all information for that block will be usable at round $mK + d - 1$. Thus, the effective delay per block is amortized over $K$: the algorithm “waits” but learns more efficiently.

Together, these effects decouple the noisy gradient estimator from delay-related drift. The once-problematic regret term proportional to $d n/\delta$ simplifies neatly to depend only on $d$.

The improvement in the delay-dependent regret term. The problematic 1/delta factor is removed, leading to a much tighter bound.

Decoupling delay and variance eliminates the amplification effect that previously drove up regret.

The D-FTBL Algorithm

The proposed Delayed Follow-The-Bandit-Leader (D-FTBL) implements blocking within the Follow-The-Regularized-Leader (FTRL) framework. Here’s how it works:

Initialization: Choose starting action $ \mathbf{y}_1 \in \mathcal{K}_\delta $ and set cumulative gradients $ \bar{\mathbf{g}}_0 = 0 $.
Within each block:
- Play for $K$ rounds using $ \mathbf{x}_t = \mathbf{y}_m + \delta \mathbf{u}_t $.
- Accumulate delayed feedback as it arrives:
  $ \bar{\mathbf{g}}_t = \bar{\mathbf{g}}_{t-1} + \sum_{k \in \mathcal{F}_t} f_k(\mathbf{x}_k)\mathbf{u}_k $.
Block-level update: At block’s end, compute new preparatory action $ \mathbf{y}_{m+1} $ by minimizing all observed linearized losses plus regularization:
FTRL-based updates at block boundaries ensure stability and use all available delayed feedback.

This design elegantly balances exploration (within each block) and delayed exploitation (at block transition).

Tighter Bounds and Theoretical Triumphs

Performance guarantees are where the new approach shines.

General Convex Functions

For convex loss functions, D-FTBL achieves:

The final regret bound for D-FTBL on convex functions.

Regret = O(√n T³⁄⁴ + √{d T}), combining best of non-delayed and delay-dependent performance.

Meaning:

The term $O(\sqrt{n}T^{3/4})$ matches the delay-free optimum from standard BGD.
The term $O(\sqrt{dT})$ aligns perfectly with the known theoretical lower bound.

This closes a longstanding gap between achievable and fundamental limits. Whenever the maximum delay $d$ isn’t extremely large, D-FTBL strictly outperforms previous approaches.

Strongly Convex and Smooth Functions

When loss functions have additional structure, D-FTBL becomes even stronger.

Strongly Convex Functions: If each function has a unique minimum, D-FTBL achieves:
Regret = O((n T)²⁄³ log¹⁄³ T + d log T), optimal under strong convexity.
The delay-dependent part matches the best achievable $O(d \log T)$ known from full-information delayed settings.
Strongly Convex & Smooth + Unconstrained Action Space: In this most favorable scenario, regret improves further to $O(n\sqrt{T\log T} + d\log T)$, again matching non-delayed optimum rates.

Broader Insights and Future Directions

The paper’s results go beyond an algorithm—they offer a principle: decoupling sources of error yields stability and optimality.

Blocking separates exploration (noisy bandit sampling) from update decisions (delayed, aggregate learning). By waiting to collect sufficient information before updating, D-FTBL learns more effectively despite delayed feedback.

The technique sets a new state of the art for delayed bandit convex optimization and opens several questions:

Can similar blocking mechanisms help other bandit algorithms handle delay or high noise?
Can we derive bounds depending on average delay ($\bar{d}$) rather than maximum delay ($d$)?
How can blocking adapt dynamically in environments with unpredictable feedback latencies?

Regardless of future progress, one lesson persists:

Patience is powerful.
In uncertain, slow-feedback environments, waiting—strategically—can be the fastest way to learn.

Background: The Landscape of Online Optimization#

Online Convex Optimization (OCO)#

The Bandit Challenge: Learning with One Hand Tied Behind Your Back#

The Delay Dilemma#

The Core Idea: Decoupling Delay and Bandit Noise Through Blocking#

The D-FTBL Algorithm#

Tighter Bounds and Theoretical Triumphs#

General Convex Functions#

Strongly Convex and Smooth Functions#

Broader Insights and Future Directions#