In the world of machine learning and operations research, textbook problems often assume that the data comes from a single, static distribution. You train your model, the data behaves politely, and you find an optimal solution.

But the real world is rarely so cooperative. Financial markets fluctuate, user preferences drift, and sensor networks experience environmental changes. In these scenarios, the underlying probability distribution generating your data changes over time. This is the domain of time-varying distributions.

When you combine this shifting landscape with complex, nonsmooth (pointy/kinked), and nonconvex (multiple valleys) objective functions, standard optimization algorithms often fall apart. They either converge too slowly because they use too much old data, or they fail to converge at all because they can’t handle the mathematical “sharpness” of the functions.

In this post, we are doing a deep dive into a 2025 research paper by Ye, Cui, and Wang that tackles this exact intersection of challenges. They propose a novel method: the Online Stochastic Proximal Difference-of-Convex Algorithm (ospDCA) using an Adaptive Sampling strategy.

We will explore how they solved the problem of getting accurate gradients for nonsmooth functions and how they designed an algorithm that “knows” when to increase its sample size to maintain accuracy without wasting computational resources.

The Problem: Nonsmooth, Nonconvex, and Non-Stationary

Let’s formally define the beast we are trying to tame. We are looking at a Difference-of-Convex (DC) optimization problem. DC programming is powerful because almost any nonconvex function can be represented as the difference between two convex functions.

The objective is to minimize a function \(f(x)\) constrained to a convex set \(C\):

The general problem formulation showing the difference between two expectation functions G and H.

Here, \(G(x, \xi)\) and \(H(x, \zeta)\) are convex functions, but their difference, \(f(x)\), is nonconvex. The variables \(\xi\) and \(\zeta\) represent random data vectors.

There are three distinct layers of difficulty here:

  1. Stochastic: We cannot compute the expected values \(g(x)\) and \(h(x)\) exactly; we have to estimate them using sampled data.
  2. Nonsmooth: The functions \(G\) and \(H\) might not have smooth derivatives (think of the absolute value function \(|x|\) or ReLU). This means we are dealing with subgradients (sets of valid slopes) rather than single gradient vectors.
  3. Time-Varying: The distributions \(P_\xi\) and \(P_\zeta\) change over time. Data collected at time \(t=1\) might be misleading at time \(t=100\).

Measuring the Shift

To handle time-varying distributions, we need a way to measure how much the “ground truth” is moving under our feet. The authors utilize the Wasserstein-1 distance, often called the “Earth Mover’s Distance.” It effectively measures the work required to transform one probability distribution into another.

Definition of the Wasserstein-1 distance between two distributions.

The assumption made in this work is that while the distributions change, they don’t change infinitely fast. The cumulative shift over time is bounded, meaning the environment eventually settles or moves slowly enough for the algorithm to catch up.

The Theoretical Breakthrough: Subdifferential Convergence

Before we get to the algorithm, we must address a major theoretical hurdle. In smooth stochastic optimization, we know that if we take an average of sampled gradients, that average converges to the true gradient very quickly. specifically, the squared error drops at a rate of \(O(1/n)\).

Standard convergence rate for smooth gradients, proportional to 1/n.

However, dealing with nonsmooth functions is much harder. When a function has a “kink,” the derivative isn’t a single number; it’s a set of numbers (the subdifferential). When we sample subgradients from a nonsmooth function, simply averaging them doesn’t guarantee the same nice convergence behavior as the smooth case.

Previous work often required restrictive assumptions, like requiring a specific “measurable selector” for the subgradient, which is hard to implement in code.

Using Support Functions

The authors of this paper derived a new convergence rate for the Sample Average Approximation (SAA) of set-valued mappings. They measure the error using the Hausdorff distance, which quantifies how far two sets (the estimated subdifferential set vs. the true subdifferential set) are from each other.

Definition of the SAA error using Hausdorff distance.

To solve this, they utilized a mathematical tool called the support function, \(\sigma(u, S)\). Instead of comparing the sets directly, they compare the support functions of the sets. By leveraging Hörmander’s formula, they converted the distance between sets into a maximization problem over unit vectors.

Calculating Hausdorff distance via support functions.

The Result: Matching the Smooth Rate

Through this geometric approach, the authors proved a powerful theorem. They showed that the expected squared error of the subdifferential mapping converges at a rate of roughly \(O(1/n)\).

The main theorem showing the squared error bound for nonsmooth subdifferentials is proportional to 1/n.

Why does this matter? This is a theoretical heavy-hitter. It proves that sampling subgradients for nonsmooth functions is statistically just as efficient as sampling gradients for smooth functions. This result allows the algorithm to pick any valid subgradient from the sample set (computationally easy) rather than hunting for a specific mathematical selection (computationally hard), while still guaranteeing fast convergence.

The Core Method: Adaptive ospDCA

Now that the theory assures us our samples are reliable, let’s look at the algorithm.

The classical DC Algorithm (DCA) works by exploiting the structure \(f(x) = g(x) - h(x)\). Since \(h(x)\) is convex, we can linearize it. At every step, we find a subgradient of \(h(x)\), draw a line (or plane) touching \(h(x)\), and replace the difficult nonconvex problem with a sequence of easier convex problems.

The Stochastic Model

At time step \(t\), the algorithm draws a batch of samples to approximate the functions. It constructs a convex approximation model, \(\bar{M}_t(d)\). Note that they do not use old samples. They only use fresh data from the current distribution to avoid bias from the distribution shift.

The approximation model M_t combining the sampled g, linearized h, and a proximal regularization term.

Here, \(\bar{y}_t\) is the estimated subgradient of the concave part \(h\). The term \(\frac{1}{2}\mu_t \|d\|^2\) is a proximal regularization term that keeps the new step \(d\) from straying too far, ensuring stability.

The subproblem to solve at each iteration becomes:

The convex subproblem minimization.

The Adaptive Sampling Strategy

The standard approach to stochastic optimization is to increase the sample size at a fixed rate (e.g., \(t^2\) or linearly) to reduce variance as the algorithm proceeds.

However, a fixed schedule is inefficient.

  1. Early training: We are far from the solution. The steps are large. We don’t need high precision to know roughly which direction to go. A small sample size is fine.
  2. Late training: We are close to the critical point. The steps are tiny. Noise in the gradient can cause us to bounce around the optimum. We need high precision (large sample size).

The authors propose an Adaptive strategy. They link the required sample size (\(N_{g,t}\) and \(N_{h,t}\)) to the step size \(\|\bar{d}_t\|^2\).

If the algorithm proposes a large step \(\|\bar{d}_t\|\), it implies the current point is not a critical point, so we can afford rougher estimates. If the step is small, we are likely refining a solution, so we enforce tighter error bounds by demanding more samples.

The condition they derived to ensure convergence is:

The stepsize norm condition relating the step size d to the required sample sizes N_g and N_h.

This inequality acts as a regulator. The left side represents the progress (magnitude of the step). The right side represents the error tolerance, which shrinks as \(N\) increases. As the step size \(\|\bar{d}_t\|\) approaches zero (convergence), the required \(N\) must increase to satisfy the inequality.

Convergence Analysis

Does this actually work? The authors provide a rigorous proof that the algorithm converges almost surely to a DC critical point.

The analysis relies on the sufficient descent property. In deterministic optimization, we expect the objective value to decrease at every step. In the stochastic, time-varying setting, we can only hope for this in expectation, and we must account for the error introduced by sampling and the distribution shift (Wasserstein distance).

The descent inequality in expectation, accounting for sampling errors and distribution shifts.

The equation above shows that the expected improvement depends on the step size (the good part), minus the sampling errors (the \(C/N^\alpha\) terms) and the distribution shifts (\(W_1\) terms).

By ensuring the cumulative distribution shift is finite (the environment stabilizes) and increasing sample sizes adaptively, the error terms vanish faster than the step sizes.

Convergence of the step size d_t to zero in expectation.

This result proves that the sequence of iterates generated by the algorithm will eventually stop moving, implying it has found a stationary point where the “forces” of the objective function balance out.

Experiments & Results

To validate the theory, the authors applied the Adaptive ospDCA to a problem of Online Sparse Robust Regression.

The goal is to find a sparse vector \(\beta\) that minimizes prediction error.

  • Robust: They use absolute error (L1 loss) instead of squared error to handle outliers.
  • Sparse: They use a capped-L1 penalty, which is a nonconvex DC function, to force many coefficients to zero.
  • Online: The data arrives in a stream with a shifting distribution.

The problem formulation looks like this:

The robust regression objective function.

Performance Comparison

The experiments compared the Adaptive ospDCA (blue) against non-adaptive versions (green/magenta) and other baselines.

In the charts below, the Y-axis is the distance to the optimal solution (log scale). The Adaptive algorithm clearly outperforms the others, achieving a lower error much faster.

Performance comparison graphs showing Adaptive ospDCA converging faster than baselines.

The Behavior of Adaptive Sampling

The most illuminating result is the plot of sample sizes over time.

Look at the blue dashed line in the charts below.

  • Phase 1: For the first ~60-80 iterations, the sample size stays very low (near zero on the Y-axis). The algorithm is taking large steps and making easy progress.
  • Phase 2: As it approaches the solution, the sample size skyrockets (the spike at the end).

Sample size evolution showing a sharp increase in samples only in the final stages of optimization.

This visualizes exactly why the adaptive method is efficient. The standard methods (green line) increase sample size gradually, wasting computation early on when it isn’t needed, and potentially not having enough precision later on. The adaptive method saves its “computational budget” for the end game, where precision is critical to nail the exact solution.

Conclusion & Implications

The paper “An Online Adaptive Sampling Algorithm for Stochastic Difference-of-convex Optimization with Time-varying Distributions” provides a robust framework for modern optimization problems.

Key Takeaways:

  1. Nonsmooth is fine: Thanks to the new \(O(\sqrt{p/n})\) pointwise convergence rate for subdifferentials, we can handle sharp kinks in our objective functions with the same sampling efficiency as smooth functions.
  2. Adaptivity wins: Linking sample size to the optimization step size allows the algorithm to be computationally cheap when exploring and computationally expensive only when refining.
  3. Robustness to Shift: The algorithm theoretically handles time-varying distributions, provided the total “drift” is bounded.

This work bridges the gap between theoretical nonsmooth analysis and practical online learning. For students and practitioners, it offers a blueprint for designing algorithms that don’t just blindly consume data, but adapt their hunger based on how close they are to the goal.