Visual reinforcement learning (RL) has pushed the boundaries of what robots can do, from beating Atari games to performing complex dexterous manipulation. However, a significant gap remains between a robot that performs well in a controlled simulation and one that is robust enough for the real world. A major part of this challenge lies in vision.

In the real world, depth perception is crucial. Humans achieve this naturally through binocular vision—using two eyes to triangulate 3D structure. Similarly, robots benefit immensely from multiple camera views. Merging these views creates a richer representation of the world, overcoming occlusions and improving learning speed (sample efficiency).

But there is a catch. If you train a robot to rely on three specific cameras, and one of them malfunctions or gets blocked during deployment, the robot usually fails. The policy becomes “overfitted” to that specific multi-view setup.

How do we design a system that benefits from the rich data of multiple cameras during training but remains robust enough to function if those cameras disappear at test time?

This is the problem addressed by Almuzairee et al. in their paper, “Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation.” They introduce a novel algorithm called MAD (Merge And Disentangle).

Figure 1 illustrating the concept of merging views for better representations and disentangling them for robust deployment.

As shown in Figure 1, the goal is twofold: Merge views to learn better representations, and Disentangle them so the policy can function with any singular view during deployment.

The Two Competing Goals

To understand why this research is significant, we first need to understand the tension between two competing objectives in robotic learning:

  1. Merging for Efficiency: To learn a task quickly (sample efficiency), the robot needs the best possible data. Combining a first-person view (from the robot’s wrist) with third-person views (from the room) provides a complete picture of the state space.
  2. Disentangling for Robustness: To be deployable, a robot cannot be fragile. If a sensor fails, the robot should degrade gracefully, not crash. This requires the neural network to understand that “View A” contains useful information independently of “View B.”

Prior work generally picked one lane. Some methods focused on fusing views for maximum performance, resulting in fragile policies. Others focused on robustness (disentanglement) but often sacrificed the learning speed gained from combined views. MAD attempts to unify these by using a clever architectural trick involving feature-level data augmentation.

The Setup: Defining the Environment

The researchers tested their hypothesis on two popular robotic manipulation benchmarks: Meta-World and ManiSkill3.

Figure 2 showing the environment setup with First Person, Third Person A, and Third Person B views.

As illustrated in Figure 2, the setup typically involves three cameras:

  • First Person: Mounted on the robot arm (great for precision, bad for context).
  • Third Person A & B: Fixed in the environment (great for context, prone to occlusion).

The challenge is to train a single agent that can utilize all three inputs when available, but still succeed if handed only one.

The Core Method: MAD

The MAD framework is built on top of the DrQ (Data Regularized Q-Learning) algorithm, a standard approach for visual RL. The innovation lies in how the image data flows through the network and how the loss functions are calculated.

1. The Merge Module (Architecture)

The first step is handling the inputs. In a typical multi-view setup, one might stack images together or concatenate their features. MAD takes a slightly different approach to ensure flexibility.

It uses a single shared CNN encoder (\(f_\xi\)). Whether the input is Camera 1, Camera 2, or Camera 3, it passes through the same neural network weights. This produces a feature vector for each view, denoted as \(\mathcal{V}_t^i\).

To create the “Merged” representation (\(\mathcal{M}_t\)), the system simply sums the feature vectors:

\[ \mathcal{M}_t = \sum_i^n \mathcal{V}_t^i \]

Why Summation? Summation is crucial here. Unlike concatenation, which changes the size of the vector depending on how many cameras you have, summation keeps the feature vector size constant. This allows the downstream policy (the brain of the robot) to accept input from one camera, three cameras, or five cameras without changing its architecture. It naturally preserves the magnitude of the signal, giving the network a hint about how much information is present.

2. Disentanglement via Augmentation

This is the heart of the paper’s contribution. If you simply train the robot on the merged vector \(\mathcal{M}_t\), it will never learn to use the single views \(\mathcal{V}_t^i\).

However, if you naively train on both merged and single views simultaneously, the network gets confused. It treats them as completely different states, which destabilizes learning and makes the robot learn slower.

The researchers’ solution is to treat the single-view features as Data Augmentations of the merged features.

In visual RL, data augmentation (like randomly cropping an image) is used to make networks robust. MAD applies this logic at the feature level. It tells the network: “This single camera view is just a noisy/augmented version of the full multi-view reality. Learn to predict the same action/value from this partial view as you would from the full view.”

Figure 3 diagram showing the Shared Encoder, Actor Update, and Critic Update logic.

Figure 3 details this flow.

  • Left: The shared encoder processes views individually.
  • Middle/Right: The merged features (\(\mathcal{M}_t\)) are the “clean” stream. The single views (\(\mathcal{V}_t^i\)) are the “augmented” stream.

3. The Loss Function (The Math)

To stabilize this training, the authors adapt a technique called SADA (Stabilizing Actor-Critic under Data Augmentation). The core idea is to anchor the learning targets using the clean, unaugmented data (the merged views), while forcing the network to produce consistent results from the augmented data (the single views).

Let’s look at the Actor Loss (the policy update).

Equation 1: The Actor Loss formulation showing unaugmented and augmented components.

Here is the breakdown of the equation above:

  1. \(\mathcal{L}^{UnAug}\): This is the standard loss calculated using the Merged Features. This ensures the robot learns efficiently using all available information.
  2. \(\mathcal{L}^{Aug}\): This is the loss calculated using Single View Features (\(\mathcal{V}_t^i\)). Note that it tries to match the Q-values (expected rewards) derived from the merged state.
  3. \(\alpha\) (Alpha): This is a hyperparameter that balances the two. It weighs how much the network should prioritize the “perfect” multi-view data versus the “robust” single-view data. The authors found \(\alpha=0.8\) to be optimal.

The Critic Loss (the Q-function update) follows a similar logic:

Equation 2: The Critic Loss formulation.

The Critic learns the value of actions. By calculating the target value using the stable, merged next-step state (\(\mathcal{M}_{t+1}\)), the training remains stable even though the input is the partial, single-view state.

Experimental Results

The theory sounds solid, but does it work? The researchers compared MAD against several strong baselines:

  • MVD: A method focusing specifically on disentanglement using contrastive learning.
  • VIB: A method using information bottlenecks to merge views.
  • Single Camera: A standard agent trained on just one view.

1. Robustness and Sample Efficiency

The primary results are striking. The graphs below show the success rate (Y-axis) over training steps (X-axis).

Figure 4 showing performance graphs across Meta-World and ManiSkill3.

Key Takeaways from Figure 4:

  • Sample Efficiency (Blue Line vs. Others): In the “All Cameras” plots (far left), MAD (Blue) learns much faster and reaches a higher success rate than the baselines. This confirms that the Merging strategy is working.
  • Robustness (The Single View plots): Look at the “Third Person A” or “Third Person B” plots. Even though MAD was trained with all cameras available, it performs exceptionally well when tested on just one camera.
  • Comparison to Single Camera Baseline: In many cases, MAD evaluated on a single view outperforms a model trained specifically on that single view (Orange line). This implies that seeing the task from multiple angles during training helped the robot understand the single view better than if it had only ever seen that one view.

2. Why does it work? (Ablation Studies)

To prove that every part of their engine is necessary, the authors performed ablation studies.

Figure 5 showing ablation studies for components and alpha values.

In Figure 5 (Right), we see the component breakdown:

  • Naive Both (Brown): Training on merged and single views without the specific MAD/SADA loss formulation fails efficiently. It essentially confuses the agent.
  • Merged Only (Gray): Training only on merged views works great if you have all cameras, but fails if you lose one (this isn’t shown in this specific crop, but is discussed in the paper).
  • Singular Only (Green): Training only on single views is too slow.

MAD (Blue) is the only curve that achieves high performance, confirming that the specific combination of feature summation and selective augmentation is key.

3. Does the merge method matter?

The researchers also asked: “Does it matter if we sum the features? What if we use attention mechanisms or simple concatenation?”

Figure 6 comparing different merging strategies like Concat, Sum, Attention, and ViT.

Surprisingly, Figure 6 shows that Summation (Blue) is highly competitive with more complex methods like Attention (Brown) or Vision Transformers (Green). Given that summation is computationally cheaper and handles variable numbers of inputs naturally, it is the pragmatic winner.

Adaptability: Occlusions and Modalities

One of the most impressive demonstrations of MAD is its ability to handle “bad” views.

In a specific experiment on ManiSkill3, the researchers set up the environment such that two of the three cameras were blocked or uninformative. Only “Third Person A” could see the task.

Figure 7 showing performance under occlusion conditions.

As shown in Figure 7, MAD (Blue) matches the performance of a Single Camera agent trained on the good view.

  • MVD (Brown) fails because it tries to find shared information between the good view and the bad views, which dilutes the useful signal.
  • VIB (Gray) fails because it relies heavily on the first-person view, which was occluded in this test.

MAD ignores the noise from the bad cameras and successfully leverages the one working camera.

From RGB to RGB-D

Finally, the authors showed that MAD isn’t just for cameras located at different positions; it can also merge different types of data, such as RGB color and Depth.

Figure 8 showing MAD adapting to RGB and Depth modalities.

By treating RGB and Depth as two separate “views” and merging them, MAD (Blue) outperforms agents trained on RGB alone or Depth alone. It essentially learns a policy that is robust to lighting changes (by relying on depth) and textureless objects (by relying on RGB).

Conclusion

The “Merging and Disentangling Views” (MAD) paper offers a practical, elegant solution to a common problem in robotics. It moves away from complex auxiliary losses or contrastive learning schemes and instead relies on a strong architectural choice (feature summation) combined with a rigorous training strategy (feature-level augmentation).

For students and practitioners in Visual RL, the takeaways are clear:

  1. Don’t train on merged views alone. You create a dependency that breaks in the real world.
  2. Treat inputs as augmentations. If you want your robot to handle partial data, train it to predict the full state from that partial data.
  3. Simplicity scales. Simple feature summation outperformed complex attention mechanisms in this context, proving that sometimes the simplest tool is the right one for the job.

By enabling robots to learn efficiently from everything they can see, while preparing them for the moment they can’t, MAD brings us one step closer to truly deployable autonomous systems.