Introduction
In the rapidly evolving field of embodied AI, quadruped robots—often called “robot dogs”—are becoming essential tools for inspection, search and rescue, and industrial security. To navigate complex environments autonomously, these robots rely heavily on visual perception. Panoramic cameras, which capture a comprehensive 360-degree view, are particularly well-suited for this task, offering a field of view that standard cameras cannot match.
However, training perception models for these robots faces a significant bottleneck: data scarcity. Collecting high-quality panoramic video data is labor-intensive, costly, and technically difficult. Unlike wheeled robots or drones, quadruped robots have a unique gait that introduces high-frequency vertical vibrations—essentially, they bounce as they walk. This “jitter” creates motion patterns that are difficult to simulate and often result in blurry or unusable real-world data.
To bridge this gap, researchers have introduced QuaDreamer, the first panoramic data generation engine specifically tailored for quadruped robots.

As illustrated in Figure 1, QuaDreamer functions as a “world model.” By taking simple inputs—such as a single panoramic image and a bounding box for an object—it generates highly realistic, controllable panoramic videos that mimic the specific motion and vibration patterns of a walking robot dog. Crucially, the researchers demonstrate that this synthetic data is high-quality enough to train real-world perception models, significantly improving performance in multi-object tracking tasks.
Background: The Challenge of Panoramic Generation
Generating video for robots involves more than just creating pretty pictures; the physics must be correct. Standard video generation models, like Stable Video Diffusion (SVD), are trained on general internet videos. They fail to capture the specific, rhythmic vertical jitter inherent to quadruped locomotion. Furthermore, panoramic images (often represented in equirectangular projection) suffer from severe distortion at the top and bottom of the frame. Standard Convolutional Neural Networks (CNNs) often struggle with this, leading to geometric inconsistencies in generated videos.
The researchers identified three primary requirements for a successful quadruped world model:
- Motion Fidelity: It must replicate the high-frequency vertical jitter of the robot.
- Controllability: It needs to generate specific scenarios based on prompts (e.g., “a person walking here”).
- Panoramic Consistency: It must handle the distortion of 360-degree lenses without breaking scene geometry.
The QuaDreamer Methodology
To address these challenges, QuaDreamer introduces a sophisticated architecture built upon Latent Diffusion Models (LDM). The framework is composed of three novel components: Vertical Jitter Encoding (VJE), the Scene-Object Controller (SOC), and the Panoramic Enhancer (PE).

1. Vertical Jitter Encoding (VJE)
The most distinct characteristic of a robot dog’s video feed is the bounce. If a generative model cannot replicate this, the data is useless for training navigation algorithms. The researchers observed that the vertical movement in a video can be decomposed into two parts: the low-frequency trajectory (the robot moving forward) and the high-frequency jitter (the robot’s steps).
The VJE module uses a spectral analysis approach. It applies a high-pass filter to the vertical coordinates of objects in the scene to isolate the jitter signal.

The mathematical basis for extracting this signal involves a Butterworth high-pass filter. The frequency response \(H(f)\) is defined as:

By applying this filter, the researchers extract the specific vibration pattern of the robot, denoted as \(y_w(t)\):

Once extracted, this jitter data is not just treated as a number. It is projected into 3D world coordinates and then mapped into the camera’s coordinate system. To ensure precise control over the visual details, the researchers convert these camera poses into Plücker embeddings. This representation allows the model to learn the relationship between the camera’s vibration and the resulting pixel shifts more effectively than raw coordinates would.
2. Scene-Object Controller (SOC)
With the jitter signal extracted, the next step is to control the scene content. The Scene-Object Controller (SOC) is responsible for orchestrating the movement of the background (the world passing by) and the foreground (objects like pedestrians or cars).
The SOC effectively decomposes the scene into two fields:
- Background Motion Field: This uses the features from the VJE (the camera jitter) to inform the model how the static world should shake and move.
- Object Motion Field: This dictates how specific objects move within that world.
To represent object motion, the model uses Fourier Embeddings. This maps bounding box coordinates into a high-dimensional frequency space, allowing the neural network to capture multi-scale motion patterns much better than simple coordinate inputs.

The model also accounts for visibility—objects might be occluded or move out of frame. A visibility mask \(m_t\) acts as a switch:

Finally, the background features (\(A_{bg}\)) and object features (\(B\)) are fused. This fused representation (\(F_{sum}\)) is injected into the diffusion process via a Gated Self-Attention mechanism. This ensures that the generated video respects both the camera’s shake and the object’s trajectory simultaneously.

3. Panoramic Enhancer (PE)
Generating 360-degree video introduces geometric distortions that standard models fail to correct. To solve this, the researchers propose the Panoramic Enhancer, a dual-stream module that processes features in both the spatial and frequency domains.
This module is inserted symmetrically into the U-Net architecture (see Figure 2). It consists of two specialized technologies working in tandem:
A. State Space Models (SSM): To handle the spatial structure and the global consistency of the panoramic image (ensuring the left edge matches the right edge), the researchers utilize the S6 block (a type of State Space Model). SSMs are excellent at modeling long-range dependencies, which is critical for unwrapped spherical images where pixels far apart in the image might be geometrically close in the real world.

B. Fast Fourier Convolutions (FFC): While SSMs handle the structure, the model also needs to preserve high-frequency textures and details. Standard convolutions often create grid-like artifacts in panoramic generation. FFCs operate in the spectral domain, allowing them to capture global periodic structures and fine details without the resolution-sensitivity issues of standard CNNs.

The combination of these two streams allows QuaDreamer to generate video that is both geometrically consistent and texturally rich, even under the stress of high-frequency jitter.
Experiments and Results
The researchers evaluated QuaDreamer using the QuadTrack dataset, which contains over 19,000 panoramic frames captured from a quadruped robot’s perspective. They benchmarked their model against state-of-the-art video generation models like Stable Video Diffusion (SVD) and TrackDiffusion.
Visual Fidelity and Control
A major innovation in this paper is the introduction of a new metric called PTrack. Existing metrics measure video quality (like FVD or LPIPS) but do not measure how well the video mimics the specific camera jitter of a robot. PTrack uses a point-tracking algorithm (CoTracker) to analyze the trajectories of pixels in the generated video and compares them to the ground truth vibration profiles.

The results were significant. As shown in the visualizations below, QuaDreamer (Ours) generates trajectories that closely follow the erratic, jittery path of the ground truth (GT), whereas baseline models tend to smooth out the motion, losing the “robot dog” feel.

Quantitative results back this up. In Table 1, QuaDreamer outperforms the baseline (TrackDiffusion) in visual quality metrics (LPIPS, SSIM) and significantly improves controllability, shown by the drastic reduction in the PTrack score (lower is better).

Ablation Study: Do the components matter?
The researchers performed an ablation study to verify if the Scene-Object Controller (SOC) and Panoramic Enhancer (PE) were actually necessary.

The results show a clear trend:
- Adding SOC drastically improves the tracking metrics (MOTA), proving it helps control object placement.
- Adding PE significantly improves the video quality (FVD), proving it helps generate cleaner, more consistent panoramic images.
- Combining both yields the best overall performance.
The Ultimate Test: Downstream Perception Tasks
The most critical question for a world model is: Can a robot learn from this dream?
To answer this, the researchers trained a multi-object tracking model (OmniTrack) using a mix of real data and synthetic data generated by QuaDreamer. If the synthetic data is high-quality, the tracker’s performance on real-world test data should improve.

The results in Table 3 are compelling. Training with QuaDreamer data improved the HOTA (Higher Order Tracking Accuracy) by 10.1% and MOTA (Multiple Object Tracking Accuracy) by 14.8% compared to using real data alone. This confirms that QuaDreamer captures the challenging dynamics of robot locomotion well enough to serve as a valuable data augmentation tool.
Handling Real-World Challenges
One of the subtle advantages of QuaDreamer is its ability to generate clean data even when the source domain is messy. In real-world datasets, the robot’s vibration often causes motion blur in the captured images. QuaDreamer, however, can generate sharp, clean video sequences while maintaining the motion dynamics, effectively “de-blurring” the training data for downstream models.

Conclusion
QuaDreamer represents a significant step forward in embodied AI. By explicitly modeling the unique vertical jitter of quadruped robots and addressing the geometric challenges of panoramic imagery, the researchers have created a powerful engine for data generation.
This system does more than just generate video; it provides a scalable way to train robust perception systems without the need for hundreds of hours of expensive field testing. As the authors note, future work may extend this to include other sensors (like depth or infrared) and more complex robot maneuvers, paving the way for robots that can learn to navigate the world before they ever take their first step.
](https://deep-paper.org/en/paper/2508.02512/images/cover.png)