Introduction
Imagine you are standing in a kitchen, holding a pitcher of water. You are pouring it into a cup. Now, imagine you close your eyes. Can you still fill the cup without spilling?
Most likely, yes. As the water level rises, the pitch of the sound changes—the “glug-glug” becomes higher and faster as the resonant frequency of the remaining air space shifts. This is a prime example of multimodal perception. Humans don’t just see the world; we hear, touch, and feel it. We integrate these senses to perform complex tasks effortlessly.
For robots, however, this is an immense challenge. While we have made massive strides in computer vision, allowing robots to “see,” we often leave them “deaf.” Collecting real-world data that synchronizes vision, touch, and audio is incredibly expensive and labor-intensive. To get around this, roboticists often turn to simulation—training robots in virtual worlds before deploying them in the real one. This is known as Sim-to-Real transfer.
But here lies the problem: while we have great visual simulators (think video games with realistic physics), simulating sound is a nightmare. Calculating how sound waves bounce off complex geometries in real-time requires heavy computational physics that simply doesn’t scale.
In a fascinating new paper titled “The Sound of Simulation,” researchers from UC Berkeley propose a novel solution. Instead of trying to calculate the physics of sound, why not hallucinate it? They introduce MULTIGEN, a framework that uses generative AI to synthesize realistic audio for silent simulations. By training robots in these “sound-enabled” simulations, they achieved a breakthrough: robots that can pour liquids accurately in the real world, relying on sound they learned entirely from synthesized data.
The Context: Why Robots Need to Listen
Before diving into the method, let’s establish why this research matters.
The Data Bottleneck
Modern AI, particularly deep learning, is hungry for data. In robotics, gathering this data is physical and slow. You have to move a real robot arm, record the video, record the audio, and hope nothing breaks. Large datasets like “Open X-Embodiment” exist, but they are overwhelmingly visual. They lack the diverse sensory constellations—like audio and tactile feedback—that are crucial for dexterous manipulation.
The Limits of Simulation
To bypass the data bottleneck, researchers use physics simulators (like MuJoCo or Isaac Gym). In these virtual worlds, a robot can attempt a task millions of times in minutes. If the simulator is “high-fidelity” (realistic enough), the robot can transfer that skill to the real world.
However, standard simulators are mute. They model rigid body dynamics and light transport (rendering), but they rarely model acoustic propagation. Simulating sound involves solving the wave equation, which depends on material properties, room geometry, and fluid dynamics. Doing this accurately is computationally prohibitive for large-scale training.
This leaves us with a gap: How can we train a robot to use sound if our simulators are silent and real-world data is too scarce?
The Core Method: The MULTIGEN Framework
The researchers’ answer is to stop treating audio generation as a physics problem and start treating it as a generative modeling problem.
The framework, called MULTIGEN, creates a hybrid pipeline. It runs a traditional physics engine to handle the robot’s movement and visuals, but it runs a generative AI model in parallel to “watch” the silent simulation and generate the corresponding sound effects in real-time.

As shown in Figure 1, the pipeline works in two phases:
- Generation in Sim: A physics simulator creates the visual scene. This visual data is fed into a generative model (trained on real-world data) which synthesizes audio.
- Deployment in Real: The robot uses a policy trained on that synthetic audiovisual data to act in the real world.
Let’s break down the architecture step-by-step.
1. The Physics Simulator (RoboVerse)
The team used RoboVerse, a scalable simulation platform. They set up a pouring task involving a robot arm, a pouring container, and a receiving container.
The simulator handles the “eyes and hands” of the robot:
- Visuals: It renders photorealistic images.
- Physics: It calculates how the liquid flows (using particle-based fluid dynamics).
- Action: It tracks the robot’s joint positions.
Crucially, they applied Domain Randomization (DR). This means they constantly randomized the lighting, table textures, liquid colors, and camera positions in the simulation. This prevents the robot from memorizing a specific environment, forcing it to learn robust features that apply to the messy real world.

Figure 3 illustrates the fidelity of the simulation. The top row shows the simulation (baskets and particles), while the bottom row shows the real-world execution. The visual alignment is close, but not perfect—which is exactly why the next step is so important.
2. The Generative Audio Model
This is the heart of the innovation. The researchers needed a model that could look at a silent video of pouring and generate accurate sound. They chose MMAudio, a state-of-the-art video-to-audio diffusion model.
However, using MMAudio “out of the box” failed. Why?
- General vs. Specific: Pre-trained models are trained on general internet videos. They know what a dog barking sounds like, but they lack the fine-grained understanding of how pitch changes when water fills a specific glass.
- The Sim-to-Real Domain Gap: MMAudio expects real video. When you feed it simulation video (which looks slightly cartoonish or distinct from reality), the model gets confused and generates poor audio.
The Fix: Finetuning and Segmentation Masks
To solve this, the authors made two critical adjustments:
Finetuning on Real Data: They curated a dataset of real-world pouring sounds from YouTube and the EPIC-Kitchens dataset. This taught the model the specific acoustic nuances of liquids (the splash, the glug, the trickle).
Bridging the Visual Gap with SAMv2: This is the clever part. To help the audio model understand the simulation video, they didn’t just feed it raw pixels. They used SAMv2 (Segment Anything Model) to generate “segmentation masks”—basically, color-coded maps telling the AI “this blob is the cup,” “this blob is the liquid.”
Since segmentation masks look the same in simulation and reality (a cup mask is just a shape), this provided a common language for the model. By conditioning the audio generator on these masks, the model could ignore the “fake” look of the simulation textures and focus on the geometry and movement of the liquid.

Figure 2 visualizes this complete pipeline.
- Top (Real): The model learns the relationship between visuals, masks, and audio using real videos.
- Bottom (Sim): The simulator generates silent video and masks. The trained generative model fills in the audio.
- Result: A dataset of millions of pouring events with synchronized Vision, Action, and Synthetic Audio.
3. The Robot Policy
Finally, a Diffusion Policy (a type of imitation learning model) was trained on this purely synthetic dataset. The policy takes in:
- Visual frames (RGB).
- Audio spectrograms (sound).
- Proprioception (joint angles).
It outputs the necessary motor commands to pour the liquid.
Experiments & Results
The researchers evaluated their system on a Kinova Gen3 robot. The task was to pour specific amounts of liquid (e.g., “pour half,” “fill completely”) into various containers.
The catch? The robot had never seen or heard a real pour. It was trained entirely in the matrix of MULTIGEN. This is known as Zero-Shot Sim-to-Real Transfer.
Metric: Normalized Mean Absolute Error (NMAE)
To measure success, they calculated the error between the desired amount of liquid and the actual amount poured.

Does Audio Actually Help?
The researchers compared a Vision-Only policy against their Vision + Audio policy.
The hypothesis was that audio should help significantly when visual cues are blocked (occlusion). For example, if you are pouring into a metal thermos or an opaque red cup, you cannot see the liquid level rising. You must listen.

Table 1 confirms the hypothesis strongly.
- Vision + Audio (V+A) consistently outperformed Vision-Only (V).
- Opaque Containers (Top 4 rows): Look at the first column (Red Cup). The error dropped from 0.54 (Vision) to 0.44 (Vision+Audio). In the metal thermos task, error dropped from 0.54 to 0.33.
- Transparent Containers (Bottom 4 rows): Even when the robot could see the liquid (plastic cups), audio still helped reduce error, likely by providing better flow-rate estimation.
Is Generative Audio Better than Noise?
A skeptic might ask: “Do we really need a complex Generative AI model? Can’t we just play random splashing noises?”
To test this, they compared MULTIGEN against a baseline that simply augmented the data with random environmental noise (standard practice in robotics).

Figure 4 presents a compelling analysis:
- Graph (a): This scatter plot measures the quality of the audio.
- X-axis (Diversity): How varied are the sounds?
- Y-axis (Fidelity/SDR): How physically accurate is the sound to the video?
- The Blue dots (MULTIGEN) are distributed high and to the right. This means the audio is both diverse (covering many pouring scenarios) and high-fidelity (accurate to the physics). The Red dots (Noise Augmented) cluster at the bottom—low quality.
- Graph (b): This shows the learning curve. As the robot sees more training trajectories (x-axis), the error rate (y-axis) drops much faster and lower with MULTIGEN (Blue line) compared to noise augmentation (Red line).
Conclusion and Implications
The “Sound of Simulation” paper presents a significant shift in how we think about robot learning. For years, the difficulty of mathematically modeling sound prevented it from being a first-class citizen in simulation training. By leveraging the power of Generative AI, this team bypassed the physics equations entirely.
Key Takeaways:
- Sim-to-Real works for Audio: We can train robots to “hear” using entirely synthetic data.
- Generative AI as a Physics Engine: We don’t always need to solve differential equations. Sometimes, a well-trained diffusion model is a better, faster, and more scalable simulator.
- Multimodality is Key: Robots, like humans, perform better when they use all their senses.
This approach opens the door to simulating other “hard-to-model” modalities. Could we use generative models to synthesize tactile data (touch) or thermal data? As generative models become more sophisticated, the line between simulation and reality continues to blur, bringing us closer to robots that can perceive the world as fully as we do.
](https://deep-paper.org/en/paper/2507.02864/images/cover.png)