The dream of Virtual Reality (VR) has always been the “Holodeck” concept—the ability to step into a digital recording of the real world and experience it exactly as if you were there. You want to be able to walk around, lean in to see details, look behind you, and hear the soundscape change as you move.

While we have made massive strides with technologies like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, we hit a wall when it comes to dynamic scenes. Most current datasets are either static (frozen in time), object-centric (looking at a single object from the outside), or silent (no audio).

To achieve true immersion, we need a new standard. We need Immersive Volumetric Video.

In this post, we are diving deep into a paper titled “ImViD: Immersive Volumetric Videos for Enhanced VR Engagement”. The researchers behind this work have introduced a groundbreaking dataset and a reconstruction pipeline designed to satisfy the four pillars of immersive media:

  1. Full 360° views (foreground and background).
  2. 6-DoF Interaction (you can walk around in the video).
  3. Multimodality (synchronized spatial audio and video).
  4. High-Fidelity Dynamics (5K resolution, 60FPS, long duration).

Let’s explore how they built the hardware to capture reality and the software to reconstruct it.

Figure 1. We introduce ImViD, a dataset for immersive volumetric videos. ImViD records dynamic scenes using a multi-view audio-video capture rig moving in a space-oriented manner.

The Problem: Why Current Datasets Fall Short

To train AI models that can reconstruct reality, you need data—specifically, video data captured from multiple angles simultaneously. However, existing datasets have limitations that break the illusion of immersion.

Most datasets fall into two traps:

  1. Monocular/Handheld: Someone walks around with a single camera. This is great for static scenes, but if the scene is moving (dynamic), a single camera can’t capture the “frozen time” state from multiple angles.
  2. Fixed Camera Arrays: A dome of cameras surrounds a small area. This allows for “bullet time” effects, but the viewer is usually stuck looking inward at a central object. You can’t explore the environment.

Furthermore, almost none of these datasets prioritize audio. In the real world, if you step closer to a piano, it gets louder. If you turn your head, the sound shifts ears. Current datasets largely ignore this multimodal aspect.

Table 8. Existing real-world datasets for dynamic novel view synthesis.

As shown in the table above, previous datasets like PanopticSports or ZJU-Mocap are often low resolution, static, or lack audio. ImViD (the bottom row) stands out by combining 46 cameras, a moving rig, high resolution (5K), high frame rate (60FPS), and synchronized audio.

The Solution: A Moving Capture Rig

To capture a “space-oriented” video—where the user is inside the scene looking out—the researchers built a custom mobile rig.

This isn’t just a tripod. It is a remotely controlled mobile cart equipped with a hemispherical array of 46 synchronized GoPro cameras.

Figure 3. Our rig support two kinds of capturing strategies for high resolution, high frame rate and 360 dynamic data acquisition.

The Hardware Setup

  • Cameras: 40+ Action Cameras shooting at 5K resolution and 60 FPS.
  • Synchronization: A custom control system ensures every camera fires at the exact same millisecond.
  • Mobility: The entire array is mounted on a wheeled cart that acts as a “robot,” moving through the environment to capture more volume.

Two Capture Strategies

The researchers employed a two-step strategy to ensure high fidelity:

  1. Static Capture: First, they capture the static environment (the room, the trees) with high-density photos.
  2. Dynamic Capture: Then, they record the action. This can be done with the rig Fixed (stationary while actors move) or Moving (the rig drives through the scene).

The “Moving” strategy is particularly novel. By moving the camera array through the scene, the researchers drastically increase the Spatiotemporal Capture Density.

Figure 4. Calculation method for spatiotemporal capture density.

As illustrated in Figure 4, a handheld camera (1) captures a thin line of data. A fixed array (2) captures a small bubble. The ImViD moving rig (3) sweeps through the space, capturing a massive volume of visual data over time (\(0.10 m^3/s\)). This allows for a much larger exploreable area in VR.

The Reconstruction Pipeline

Capturing the data is only half the battle. The raw footage consists of 46 separate video files. How do we turn that into a cohesive 3D hologram you can view in a headset?

The researchers proposed a complete pipeline covering both Dynamic Light Field Reconstruction (the visuals) and Sound Field Reconstruction (the audio).

Figure 2. The pipeline to realize the multimodal 6-DoF immersive VR experiences.

Part 1: Visuals with Spacetime Gaussian Splatting (STG++)

For the visual component, the team chose to build upon 3D Gaussian Splatting (3DGS). If you are unfamiliar with 3DGS, imagine representing a scene not as triangles (meshes), but as millions of fuzzy, colored 3D blobs (Gaussians).

Standard 3DGS is great for static images. For video, the researchers utilized a method called Spacetime Gaussians (STG).

In STG, the opacity and motion of the Gaussians change over time. The opacity \(\alpha\) at a specific time \(t\) is modeled using a Radial Basis Function (RBF), while motion and rotation are fitted with polynomials. The equation for a Gaussian at time \(t\) looks like this:

Equation for Spacetime Gaussian opacity and geometry over time.

The “Flicker” Problem and STG++

The researchers discovered a flaw when applying standard STG to their real-world data. Even with high-end cameras, auto-exposure and white balance can vary slightly between different lenses. When you move your head in VR, transitioning from Camera A’s view to Camera B’s view causes the colors to shift or flicker. This breaks immersion.

To fix this, they introduced STG++. They added a learnable Affine Color Transformation.

Equation for Affine Color Transformation.

Here, the rendered color \(C'\) is adjusted by a transformation matrix \(W\) and an offset \(T\) to match the specific characteristics of the camera view. This ensures that as you rotate your head, the colors remain consistent, smoothing out the differences between the 46 cameras.

This improvement significantly reduced “floaters” (artifacts floating in space) and flickering.

Figure 6. The continuity of pixels at the same location across different frames and segments.

Figure 6 demonstrates this improvement. The top row (Original STG) shows inconsistent brightness (green channel) over time. The bottom row (with Color Mapping) is much smoother and more consistent.

Part 2: Sound Field Reconstruction

A truly immersive video isn’t just silent. The researchers developed a geometric approach to reconstruct the sound field without needing complex neural network training for the audio.

They treat the recording microphone as the origin \((0,0)\). They calculate the position of the Sound Source (the actor) and the Listener (the VR user).

Using these coordinates, they calculate two critical factors:

  1. Direction Mapping (\(\theta_s\)): The angle of the sound relative to the user’s head. Equation for calculating sound source angle.
  2. Distance Mapping (\(\lambda\)): How far away the sound is (which determines volume/attenuation). Equation for calculating sound attenuation based on distance.

Once they have the angle and distance, they apply a Head-Related Transfer Function (HRTF). The HRTF modifies the audio frequencies to simulate how sound waves bounce off your ears and head, tricking your brain into hearing the sound from a specific direction.

Equation for Left and Right ear audio synthesis.

This creates a binaural 3D audio stream that updates in real-time as the user walks around the virtual room.

Experiments and Results

The researchers benchmarked their method (STG++) against other leading dynamic rendering techniques: 4DGS and 4D Rotor Gaussian Splatting.

Visual Performance

They evaluated the methods on scenes like an opera singer, a laboratory, and a puppy playing outdoors. They used metrics like PSNR (Peak Signal-to-Noise Ratio—higher is better) and LPIPS (Perceptual Similarity—lower is better).

Table 3. Test views performance of 3DGS-based dynamic scene reconstruction method on ImViD dataset.

As seen in Table 3, STG++ consistently outperforms the other methods, achieving higher PSNR scores (over 31 dB in the Opera scene) and lower perceptual error.

Visually, the difference is stark. In the figure below, look at the sharpness of the opera singer’s face and the laboratory equipment. 4DGS often blurs details in motion, while STG++ maintains crisp edges.

Figure 5. Comparison of the rendering results of four baselines on Scene 1 Opera, Scene 2 Laboratory, and Scene 6 Puppy.

In Scene 6 (Puppy), which is a complex outdoor environment with grass and fur (the nightmares of computer vision), STG++ manages to retain texture that other methods smooth out.

Audio Performance

Since there is no “ground truth” for how a reconstructed sound field should feel subjectively, the researchers conducted a user study with 21 experts.

Table 5. User study for the sound field construction.

The results were overwhelmingly positive. Over 60% of participants rated the spatial perception as “Excellent,” and over 90% found the experience immersive. This validates that the simple geometric mapping combined with high-quality capture is effective for VR.

The Final Experience

By combining the high-fidelity visual reconstruction of STG++ with the spatial audio pipeline, the researchers created a fully navigable volumetric video.

Figure 7. Visualization of the interaction trajectory and corresponding visual & auditory results.

In Figure 7, you can see a visualization of a user’s path (the blue line) through the virtual space. As the user moves from point 1 to point 5, the visual perspective shifts smoothly, and the audio waveform changes intensity and channel balance based on their proximity to the sound source (the orange dot).

Conclusion: A New Benchmark for VR

The ImViD paper represents a significant maturity in the field of volumetric video. By moving away from static, silent datasets and embracing the complexity of moving cameras and multimodal capture, the researchers have provided the community with a challenging new benchmark.

Key takeaways for students and researchers:

  1. Data Matters: You cannot build immersive VR algorithms on low-resolution, static data. The ImViD dataset fills a critical gap.
  2. Hardware Innovation: Sometimes, better software isn’t enough. You need to build a robot (the capture rig) to get the data you need.
  3. Color Consistency: In multi-view geometry, normalizing color and exposure between cameras is just as important as the geometry itself.
  4. Audio is Vital: Geometric sound reconstruction is a computationally efficient way to drastically increase immersion.

This work paves the way for future VR experiences where we can relive memories—concerts, family gatherings, or historical events—not just by watching them on a screen, but by standing inside them.