Introduction: The Data Bottleneck in Robotics
Imagine a fleet of hundreds of autonomous vehicles or delivery drones operating in the real world. Every second, these machines capture high-resolution video, LiDAR scans, and telemetry. Collectively, they generate terabytes of data every single day. Ideally, we would upload all this data to the cloud, label it, and use it to train smarter, safer AI models.
But in reality, this is impossible.
We face two crushing bottlenecks. First, bandwidth is limited. A car operating in a busy city or a remote rural area cannot upload terabytes of raw sensor data over a 4G or 5G connection; the network simply can’t handle it. Second, annotation is expensive. Even if we could upload everything, human annotators (or expensive foundation models) cannot label millions of images a day. We have a limited budget for ground-truth labeling.
This creates a dilemma: We have massive amounts of data, but we can only use a tiny fraction of it. The critical question becomes: Which specific data points should a robot upload, and which of those should the cloud decide to label?
In this post, we dive deep into a research paper titled “Distributed Upload and Active Labeling for Resource-Constrained Fleet Learning” (DUAL). The researchers propose a mathematically grounded framework that allows robots to make decentralized decisions about what to upload, while the cloud makes centralized decisions about what to label. The result is a system that learns significantly faster and better than traditional methods, even under strict resource constraints.
The Core Problem: Bandwidth vs. Budget
To understand the solution, we must first rigorously define the problem. We are dealing with a multi-robot system where \(N\) robots are deployed in different environments.
- Local Constraints (The Robot): Each robot observes a stream of data. However, it has a “cache limit” or upload bandwidth limit. It can only send a small subset of its observations to the cloud.
- Global Constraints (The Cloud): The cloud receives data from all robots. However, it has a “labeling budget.” It can only send a small subset of the uploaded data to human annotators.
Standard approaches usually fail here. If robots upload randomly, they might send redundant data (e.g., thousands of images of empty highways). If they upload based only on “uncertainty” (where the model is confused), they might upload noisy or outlier data that doesn’t help the model generalize. Furthermore, if the robots don’t coordinate, they might all upload the same type of “hard” examples, leading to a lack of diversity in the training set.
The Solution: DUAL Framework
The researchers introduce DUAL (Distributed Upload and Active Labeling). It is a two-stage active learning framework designed to maximize the utility of the final labeled dataset.

As illustrated in Figure 1, the process operates in a loop:
- Distributed Upload: Robots independently process their local data streams using the current model. They calculate embeddings (compressed representations of the data) and uncertainty scores. Based on a specific utility function, they select the most “valuable” subset to upload.
- Active Labeling: The cloud aggregates these uploads into a candidate pool. It then runs a second selection process to pick the absolute best samples from this pool to send for annotation, respecting the global budget.
- Retraining: The model is retrained on this new data and deployed back to the fleet.
The Mathematics of “Usefulness”
How do we define “valuable” data? The authors use Submodular Maximization.
In simple terms, a submodular function models “diminishing returns.” If you have a dataset of 100 images of sunny roads, adding the 101st image of a sunny road adds very little value. However, adding one image of a snowy road adds massive value. We want to select a set of data points that provides the maximum information gain.
The paper specifically uses a Facility Location function to measure utility. This function rewards selecting data points that are “representative” of the underlying data distribution in the embedding space.
The utility function \(f\) is defined as:

Here:
- \(\mathcal{D}\) is the set of selected data points.
- \(\mathcal{T}\) is the target dataset (the pool of data we want to represent).
- \(emb\) represents the embedding vector of a data point.
- The formula essentially sums up the similarity between every point in the target set and the closest point in the selected set. Maximizing this value ensures the selected set “covers” the embedding space well.
The Optimization Problem
The goal of DUAL is to select local subsets \(S_i\) for each robot to upload, such that when the cloud picks the final set, the total utility is maximized. This is formulated as a constrained optimization problem:

The constraints are:
- Local: Each robot \(i\) can only select from its own observations (\(X_i\)).
- Bandwidth: The size of the upload \(|S_i|\) cannot exceed the robot’s cache limit (\(N^{cache}\)).
- Global: The total number of labeled samples cannot exceed the labeling budget (\(N^{label}\)).
The Two-Stage Greedy Algorithm
Solving this optimization problem perfectly is NP-hard (computationally impossible for large fleets). However, because the utility function is submodular (exhibits diminishing returns) and monotone (adding data never hurts), a greedy algorithm can approximate the optimal solution very effectively.
Stage 1: Robot-Side (Greedy Upload) Each robot runs a greedy algorithm locally. It starts with an empty set and iteratively adds the one data point from its stream that increases the utility function the most, until it hits its bandwidth limit. Importantly, the robot calculates this utility with respect to the current cloud dataset, ensuring it tries to upload data that is different from what the model already knows.
Stage 2: Cloud-Side (Greedy Labeling) The cloud collects all the greedy uploads. It then runs its own greedy algorithm on this combined pool to select exactly \(N^{label}\) items for human annotation.
Theoretical Guarantees
One of the paper’s strengths is providing theoretical bounds. The authors prove that this two-stage greedy approach isn’t just a heuristic; it is mathematically guaranteed to be close to the optimal solution.
Specifically, the approximation ratio is:

This inequality states that the utility of the set selected by DUAL (\(S_{DUAL}\)) is at least a fraction of the utility of the optimal set (\(S^*\)). This theoretical backing gives us confidence that DUAL remains effective even as the fleet size or data volume grows.
Experimental Setup
To validate DUAL, the researchers tested it across varying modalities (audio, images, 3D point clouds) and tasks (classification, trajectory prediction, robotic manipulation).
Introducing RoadNet: A New Real-World Dataset
A major contribution of this paper is the release of RoadNet. Many existing autonomous driving datasets are curated “snippets.” To truly test fleet learning, we need raw, continuous video streams that simulate the redundancy and volume a real car sees.
RoadNet consists of recordings from vehicle-mounted cameras across multiple cities in Turkey, capturing diverse weather (sunny, rainy, overcast) and locations (highway, urban, rural).

As shown in Figure 2, the dataset captures high variability. Figure 6 below breaks down the distribution, showing a healthy mix of conditions which is essential for testing whether DUAL can find “rare” events (like rainy rural roads) amidst common ones.

Baselines and Comparisons
DUAL was compared against several standard strategies:
- Random: Robots upload random samples.
- Entropy: Robots upload samples where the model is most unsure (prediction probability is spread out).
- Margin: Similar to entropy, focusing on the decision boundary.
- FAL (Fleet Active Learning): A previous state-of-the-art method.
- Upper Bound: An idealized scenario where robots have unlimited bandwidth to upload everything, and the cloud picks the best from the entire pool.
Results: DUAL in Action
1. Classification Performance
The researchers simulated network conditions ranging from standard connections (“Always”) to realistic, fluctuating 5G speeds based on real-world coverage maps (“Ookla”, “5G”).

Figure 3 tells a compelling story. In almost every chart:
- The Green line (DUAL) is consistently at the top, often overlapping with the Grey dashed line (Upper Bound).
- This means DUAL achieves near-optimal performance despite bandwidth constraints.
- On the RoadNet dataset (bottom row), DUAL shows a massive gap compared to random or entropy-based methods. Because driving data is highly repetitive (frame \(t\) looks a lot like frame \(t+1\)), entropy methods often get stuck picking redundant “noisy” frames. DUAL’s diversity-based approach avoids this.
Table 3 provides the hard numbers. On the RoadNet dataset, DUAL achieved up to a 14-16% improvement over the strongest baselines.

2. Autonomous Driving: Trajectory Forecasting
Moving beyond simple classification, DUAL was tested on nuScenes, a complex dataset for predicting where cars will drive next.

In Table 1, lower numbers are better (measuring error).
- MinADE (Average Displacement Error): DUAL achieves 1.09 meters error at 10 seconds, compared to 1.19m for the next best method.
- MissRate: DUAL reduces the rate of dangerously wrong predictions from 51% to 48%.
- Again, DUAL performs almost identically to the “Upper Bound,” proving that we don’t need to upload all the data to get the best results—we just need to upload the right data.
3. Real-World Robotics: “Place Red in Green”
Finally, the authors moved from simulation to the physical world using a Franka Emika Panda robot arm. The task: identify a red block in a cluttered scene and place it into a green bowl using visual input.

This is a “sim-to-real” gap challenge. The data selection happened in simulation, but the final model was tested on the real hardware.

The results in Table 2 are striking:
- Random sampling achieved an 82% success rate.
- Entropy/Margin methods actually performed worse (34-37%), likely because they focused on noisy, confusing parts of the simulation that didn’t transfer to reality.
- DUAL achieved a 95% success rate, matching the Upper Bound perfectly. It successfully identified the visual features that matter for robust physical manipulation.
Conclusion
The “Big Data” era in robotics is shifting toward a “Smart Data” era. We can no longer rely on brute-force uploading and labeling. The DUAL framework offers a principled, mathematically sound way to navigate the resource constraints of modern robotic fleets.
By decoupling the problem into two stages—Distributed Upload (filtering at the edge) and Active Labeling (filtering at the cloud)—and connecting them via submodular maximization, DUAL ensures that the limited bits we send over the air and the limited hours humans spend labeling are used as efficiently as possible.
Whether for self-driving cars navigating Turkish highways or robot arms sorting objects, DUAL demonstrates that intelligent data curation is just as important as the learning algorithm itself. As robotic fleets grow from hundreds to millions, frameworks like this will become the standard infrastructure for continuous learning.
](https://deep-paper.org/en/paper/847_distributed_upload_and_act-2543/images/cover.png)