Imagine you are holding your smartphone, recording a video of your friend running down a beach or a car racing around a track. To you, the scene is clear. But to a computer trying to reconstruct that scene in 3D, it is a nightmare.
For decades, Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM) algorithms have relied on two golden rules: the scene must be static (rigid), and the camera must move enough to create parallax (the effect where close objects move faster than far ones). Casual videos break both rules constantly. We have moving objects, we rotate cameras without moving our feet, and we film scenes where “dynamic” elements (like people or cars) dominate the view.
Today, we are diving deep into MegaSaM, a new system from researchers at Google DeepMind, UC Berkeley, and the University of Michigan. This paper proposes a robust pipeline that can take a casual, shaky, dynamic video and extract accurate camera paths and dense 3D depth maps.
What makes MegaSaM special? It bridges the gap between traditional geometric SLAM and modern deep learning, utilizing uncertainty-aware optimization and monocular depth priors to solve problems that usually cause 3D reconstruction to crash and burn.
Figure 1: MegaSaM enables accurate estimation of cameras and scene structure from casual videos. Top: Input frames. Bottom: Estimated camera path and 3D point clouds.
The Problem with “Casual” Video
Why do traditional methods fail on your iPhone videos?
- Dynamic Scenes: Traditional SLAM assumes the world is a rock. If a car moves across the frame, the algorithm might mistakenly think the camera is moving in the opposite direction. This destroys the estimated camera trajectory.
- Limited Parallax: If you stand in one spot and pan your camera (rotation-only motion), mathematically, it is extremely difficult to tell how far away objects are. A small object close up looks exactly like a huge object far away. This leads to “scale ambiguity.”
- Unknown Intrinsics: Unlike a calibrated robot, your phone changes focus and maybe even focal length on the fly, and the algorithm often doesn’t know the exact lens properties.
Recent neural approaches like CasualSAM or DROID-SLAM have attempted to fix this, but they often require hours of “test-time training” (fine-tuning a network on that specific video) or fail when the camera movement is subtle. MegaSaM aims to be fast, accurate, and robust without needing expensive per-video training.
The Architecture: Building on DROID-SLAM
To understand MegaSaM, we first need to look at its foundation: DROID-SLAM. This is a “Deep Visual SLAM” system. Unlike traditional methods that use hand-crafted features (like ORB or SIFT), DROID-SLAM uses a neural network to predict optical flow and “confidence” weights.
However, the magic isn’t just in the neural network; it’s in the Differentiable Bundle Adjustment (BA) layer.
What is Differentiable BA?
In computer vision, Bundle Adjustment is an optimization process that tweaks camera poses and 3D points to minimize the “reprojection error”—the difference between where a point should be on the screen (based on 3D geometry) and where the network sees it.
The core equation for the error \(\mathcal{C}\) looks like this:

Here, \(\hat{\mathbf{u}}_{ij}\) is the predicted flow between frames \(i\) and \(j\), and \(\mathbf{u}_{ij}\) is the flow induced by the estimated camera motion and depth. The term \(\Sigma_{ij}\) represents the weights (or confidence).
MegaSaM keeps this differentiable layer but fundamentally changes how the weights are calculated and how the system is trained to handle moving objects.
The MegaSaM Pipeline
The system is split into two main phases, mimicking a standard SfM pipeline but supercharged with deep learning:
- Camera Tracking (Frontend): Estimates camera poses and a low-res disparity (depth) map.
- Consistent Video Depth (Backend): Refines the depth map to be high-resolution and temporally consistent.
Figure 9: The MegaSaM system overview. Left: The tracking module uses differentiable Bundle Adjustment with learned motion probability. Right: The backend optimizes for consistent video depth.
Core Innovation 1: Handling Dynamics
If you feed a video of a moving person into a standard SLAM system, it treats the person as static structure, ruining the camera track. MegaSaM solves this by learning a Movement Probability Map.
The Motion Module
The researchers introduce a specific network module, \(F_m\), which looks at the current frame and its neighbors to decide which pixels represent moving objects.
Figure 10: The architecture includes a standard flow network (gray) and a specific Motion Module (blue, \(F_m\)) that predicts movement maps.
Crucially, this map allows the Bundle Adjustment layer to downweight dynamic regions. If the network sees a moving person, it assigns a low weight to those pixels in the optimization, effectively telling the geometric solver: “Ignore this part; it’s unreliable.”
Figure 3: Learned movement map. Darker areas (right) indicate high motion probability, effectively masking out the dynamic character.
Two-Stage Training
You can’t just train this all at once. If you do, the network gets confused between camera motion and object motion. The authors use a clever two-stage training process:
Ego-Motion Pretraining: Train the base network on static scenes only. This teaches the model to understand camera movement perfectly (using the loss in Eq. 7 below).

Dynamic Fine-Tuning: Freeze the base network and train only the motion module \(F_m\) on dynamic videos. This forces the motion module to fix the errors caused by moving objects.

Core Innovation 2: Solving the “Rotation” Problem
When you pan your camera without moving (rotation-only), geometric triangulation fails. You cannot calculate depth from rotation alone. This is where MegaSaM integrates Monocular Depth Priors.
Models like DepthAnything or UniDepth are trained on millions of images to predict depth from a single picture. They are great at relative depth but bad at absolute scale and temporal consistency.
MegaSaM uses these priors to initialize the disparity in the SLAM system. Instead of starting with a random guess (which leads to failure in rotation scenarios), they start with the DepthAnything prediction.
Uncertainty-Aware Global Bundle Adjustment
Here is the most technically elegant part of the paper. We don’t always want to trust the monocular prior. If the camera is moving sideways (good parallax), the geometric signal is better than the neural network’s guess. If the camera is only rotating (bad parallax), we must rely on the neural network.
How does the system decide? It calculates the Epistemic Uncertainty.
In the optimization solver (the Levenberg-Marquardt algorithm), the system computes the Hessian matrix \(\mathbf{H}\). The inverse of the Hessian approximates the covariance (uncertainty) of the parameters.

If the uncertainty for the disparity variables is high, it means the geometry is weak (likely rotation-only motion). In this case, the system dynamically increases the regularization weight \(w_d\), forcing the solution to stick closer to the monocular priors.
Figure 4: Visualization of uncertainty. Top: Rotation-only motion results in high uncertainty (yellow). Bottom: Forward motion creates a “hole” of certainty at the epipole, but high uncertainty elsewhere.
This creates a hybrid system: Geometry-based when possible, Neural-based when necessary.
Core Innovation 3: Consistent Video Depth
Once the camera poses are locked in, MegaSaM performs a final pass to generate high-quality, dense depth maps. This isn’t just running a depth network on every frame; it’s a global optimization.
The system optimizes for a depth map that satisfies three conditions, defined by this cost function:

- Flow (\(C_{flow}\)): Points should move according to optical flow.
- Temporal (\(C_{temp}\)): Depth should not flicker over time.
- Prior (\(C_{prior}\)): The structure should resemble the monocular depth prediction.
Unlike previous methods that fine-tune a neural network (which takes 20+ minutes), MegaSaM optimizes the depth values directly, taking only seconds to minutes depending on resolution.
Experiments and Results
Does it work? The authors tested MegaSaM on standard benchmarks like MPI Sintel (animated, complex motion), DyCheck (real handheld video), and In-the-Wild footage.
Camera Tracking Accuracy
The results show a massive leap in performance. In the tables below, ATE stands for Absolute Translation Error (lower is better).
On the Sintel dataset, MegaSaM drastically outperforms competitors, including the concurrent “MonST3R” method.
Table 1: MegaSaM achieves significantly lower error rates (ATE 0.018 vs 0.036 for the next best) on calibrated Sintel sequences.
The performance gap is even visible on real-world “In-the-Wild” footage, where MegaSaM maintains high accuracy while others drift.
Table 3: On in-the-wild footage, the error (0.004) is an order of magnitude lower than other robust SLAM methods.
Visualizing Trajectories
Numbers are one thing, but looking at the estimated paths tells the real story. In Figure 5, notice how MegaSaM (Red Dashed Line) sticks tightly to the Ground Truth (Blue Line), while baselines like RoDynRF or CasualSAM veer off course due to scene dynamics.
Figure 5: Camera trajectory estimation. MegaSaM (red) deviates significantly less from the ground truth than baselines.
Depth Reconstruction Quality
The depth maps produced by MegaSaM are sharper and more stable. In the comparison below, look at the segmentation and edges. CasualSAM and MonST3R often produce blurry artifacts or “ghosting” around moving objects.
Figure 6: Depth comparisons. The x-t slices (even columns) show temporal stability. MegaSaM produces smooth, continuous depth over time compared to the jittery results of other methods.
Why Every Component Matters (Ablation)
You might ask, “Do we really need the uncertainty stuff? Can’t we just use the depth priors?”
The researchers performed an ablation study to prove the necessity of each part.
- w/o mono-depth init: The system fails on rotation-only videos.
- w/o uncertainty-aware BA: The system relies too much on priors even when geometry is better, reducing accuracy.
Figure 2 visualizes this beautifully. Without the full configuration (c), the reconstructed fences and ground planes are distorted or completely wrong.
Figure 2: (a) Without mono-depth, reconstruction fails. (b) Without uncertainty-aware BA, geometry is distorted. (c) Full MegaSaM yields correct structure.
Limitations
No system is perfect. MegaSaM relies on finding some static background to track.
- Dominant Motion: If a moving object (like a bus) covers 90% of the screen, the system might track the bus instead of the world.
- Colinear Motion: If you are walking and filming yourself (a selfie video), the camera moves with your face. The system struggles to separate your face from the background.
Figure 8: Failure cases. Top: Moving object dominates the view. Bottom: Selfie video where object moves with the camera.
Conclusion
MegaSaM represents a significant step forward in computer vision. It accepts that the real world is messy—dynamic, uncalibrated, and often filmed with poor camera motion. By intelligently combining deep learning priors (which guess what the scene looks like) with geometric constraints (which calculate where things actually are), and mediating between them using uncertainty, it achieves state-of-the-art results.
For students and researchers, MegaSaM illustrates a powerful trend in modern AI: purely end-to-end black boxes are often beaten by systems that integrate deep learning into robust, mathematically grounded optimization frameworks.
For more details, interactive results, and code, you can check out the project page referenced in the paper: mega-sam.github.io.
](https://deep-paper.org/en/paper/2412.04463/images/cover.png)