CVPR 2025

Solving the Lag: How ALIEN Predicts Human Motion Despite Arbitrary Network Latency

Introduction Imagine you are playing a high-stakes match of virtual reality table tennis against a friend halfway across the world. You swing your controller, expecting your avatar to mirror the movement instantly. But there’s a catch: the internet connection fluctuates. Your swing data travels through a Wide Area Network (WAN), encountering unpredictable delays before reaching the game server or your opponent’s display. In the world of computer vision and robotics, this is known as the latency problem. Whether it is a surrogate robot replicating a human’s movements or a metaverse avatar interacting with a virtual environment, time delays caused by network transmission and algorithm execution are inevitable. ...

[AIpparel: A Multimodal Foundation Model for Digital Garments 🔗](https://arxiv.org/abs/2412.03937)

AIpparel: The First Foundation Model for Designing Digital Fashion

Fashion is an intrinsic part of human culture, serving as a shield against the elements and a canvas for self-expression. However, the backend of the fashion industry—specifically the creation of sewing patterns—remains a surprisingly manual and technical bottleneck. While generative AI has revolutionized 2D image creation (think Midjourney or DALL-E), generating manufacturable garments is a different beast entirely. A sewing pattern isn’t just a picture of a dress; it is a complex set of 2D panels with precise geometric relationships that must stitch together to form a 3D shape. To date, AI models for fashion have been “single-modal,” meaning they could perhaps turn a 3D scan into a pattern, or text into a pattern, but they lacked the flexibility to understand images, text, and geometry simultaneously. ...

[A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_A_Unified_Approach_to_Interpreting_Self-supervised_Pre-training_Methods_for_3D_CVPR_2025_paper.pdf)

Why Does 3D Pre-training Work? Unlocking the Black Box with Game Theory

In the rapidly evolving world of 3D computer vision, self-supervised pre-training has become the golden standard. Whether you are building perception systems for autonomous vehicles or analyzing 3D medical scans, the recipe for success usually involves taking a massive, unlabeled dataset, pre-training a Deep Neural Network (DNN) on it, and then fine-tuning it for your specific task. We know that it works. Pre-training consistently boosts performance. But why does it work? ...

[4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion 🔗](https://arxiv.org/abs/2412.04462)

Beyond 2D: How 4Real-Video Generates Consistent 4D Worlds in Seconds

Imagine you are watching a video of a cat playing with a toy. In a standard video, you are a passive observer, locked into the camera angle the videographer chose. Now, imagine you could pause that video at any second, grab the screen, and rotate the camera around the frozen cat to see the toy from the back. Then, you press play, and the video continues from that new angle. ...

[3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion 🔗](https://arxiv.org/abs/2409.12957)

3DTopia-XL: The Future of High-Fidelity 3D Asset Generation with Primitive Diffusion

The demand for high-quality 3D assets is exploding. From the immersive worlds of video games and virtual reality to the practical applications of architectural visualization and film production, the need for detailed, realistic 3D models is higher than ever. Traditionally, creating these assets has been a labor-intensive bottleneck, requiring skilled artists to sculpt geometry, paint textures, and tune material properties manually. In recent years, Generative AI has promised to automate this pipeline. We’ve seen models that can turn text into 3D shapes or turn a single image into a rotating mesh. However, a significant gap remains between what AI generates and what professional graphics engines actually need. Most current AI models produce “baked” assets—meshes with color painted directly onto the vertices. They often look like plastic toys or clay models, lacking the complex material properties (like how shiny metal is versus how matte rubber is) required for photorealistic rendering. ...

[3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes 🔗](https://arxiv.org/abs/2411.14974)

Beyond Gaussians: Why 3D Smooth Convexes are the Future of Radiance Fields

Introduction In the rapidly evolving world of computer vision, the quest to reconstruct reality inside a computer has seen massive leaps in just a few years. We started with photogrammetry, moved to the revolutionary Neural Radiance Fields (NeRFs), and most recently arrived at 3D Gaussian Splatting (3DGS). 3DGS changed the game by allowing for real-time rendering and fast training speeds that NeRFs struggled to achieve. It represents a scene not as a continuous volume, but as millions of discrete 3D Gaussian “blobs.” While this works incredibly well for organic, fuzzy structures, it hits a wall when dealing with the man-made world. Look around you—walls, tables, screens, and buildings are defined by sharp edges and flat surfaces. Gaussians, by their nature, are soft, round, and diffuse. Trying to represent a sharp cube with round blobs is like trying to build a Lego house out of water balloons; you need an excessive amount of them to approximate the flat sides, and it’s still never quite perfect. ...

[Zero-Shot Monocular Scene Flow Estimation in the Wild 🔗](https://arxiv.org/abs/2501.10357)

Taming the Wild - A New Standard for Zero-Shot Monocular Scene Flow

Introduction Imagine you are looking at a standard video clip. It’s a 2D sequence of images. Your brain, processing this monocular (single-eye) view, instantly understands two things: the 3D structure of the scene (what is close, what is far) and the motion of objects (where things are moving in that 3D space). For computer vision models, replicating this human intuition is an incredibly difficult task known as Monocular Scene Flow (MSF). While we have seen massive leaps in Artificial Intelligence regarding static depth estimation or 2D optical flow, estimating dense 3D motion from a single camera remains an elusive frontier. ...

[VGGT: Visual Geometry Grounded Transformer 🔗](https://arxiv.org/abs/2503.11651)

One Pass to Rule Them All: Understanding VGGT for Instant 3D Reconstruction

Introduction For decades, the field of computer vision has chased a specific “Holy Grail”: taking a handful of flat, 2D photos scattered around a scene and instantly transforming them into a coherent 3D model. Traditionally, this process—known as Structure-from-Motion (SfM)—has been a slow, mathematical grind. It involves detecting features, matching them across images, solving complex geometric equations to find camera positions, and then running iterative optimization algorithms like Bundle Adjustment to refine everything. While effective, it is computationally expensive and often brittle. ...

[UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming 🔗](https://arxiv.org/abs/2307.16375)

Breaking the Distributed Bottleneck: How UniAP Unifies Parallel Training Strategies

If you have ever tried to train a massive Large Language Model (LLM) like Llama or a vision giant like ViT, you know the struggle: a single GPU simply doesn’t cut it. To train these behemoths, we need distributed learning across clusters of GPUs. But here is the catch: simply having a cluster isn’t enough. You have to decide how to split the model. Do you split the data? Do you split the layers? Do you split the tensors inside the layers? ...

[The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition 🔗](https://arxiv.org/abs/2502.21201)

Can AI See the Chimp for the Trees? Mitigating Background Bias in Wildlife Monitoring

Introduction Imagine you are training a computer vision model to recognize a chimpanzee climbing a tree. You feed it thousands of hours of video footage. The model achieves high accuracy, and you are thrilled. But then, you test it on a video of an empty forest with no chimpanzee in sight, and the model confidently predicts: “Climbing.” Why does this happen? The model has fallen into a trap known as shortcut learning. Instead of learning the complex motion of the limbs or the texture of the fur, the model took the path of least resistance: it learned that “vertical tree trunks” usually equal “climbing.” It memorized the background, not the behavior. ...

[TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion 🔗](https://arxiv.org/abs/2504.11773)

TacoDepth: Breaking the Speed Limit in Radar-Camera Depth Estimation

In the rapidly evolving world of autonomous driving and robotics, perception is everything. Vehicles need to know not just what is around them, but exactly how far away it is. While LiDAR sensors provide excellent depth data, they are expensive. A more cost-effective alternative is fusing data from cameras (rich visual detail) and mmWave Radar (reliable depth and velocity). However, Radar-Camera fusion has a major bottleneck: efficiency. Existing methods are often slow and computationally heavy, relying on complex, multi-stage processes that act like stumbling blocks for real-time applications. ...

[Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models 🔗](https://arxiv.org/abs/2501.01423)

Breaking the Trade-off: How Aligning VAEs with Foundation Models Supercharges Diffusion Training

Introduction In the rapidly evolving world of generative AI, Latent Diffusion Models (LDMs) like Stable Diffusion and Sora have become the gold standard for creating high-fidelity images and videos. These models work their magic by not operating on pixels directly, but rather in a compressed “latent space.” This compression is handled by a component called a Visual Tokenizer, typically a Variational Autoencoder (VAE). For a long time, the assumption was simple: if we want better images, we need better tokenizers. Specifically, we assumed that increasing the capacity (dimensionality) of the tokenizer would allow it to capture more details, which would, in turn, allow the diffusion model to generate more realistic images. ...

[Navigation World Models 🔗](https://arxiv.org/abs/2412.03572)

Can Robots Dream of Walking? Understanding Navigation World Models

Introduction How do you navigate a crowded room to reach the exit? You likely don’t just stare at your feet and react to obstacles the moment they touch your toes. Instead, you project a mental simulation. You imagine a path, predict that a person might step in your way, and adjust your trajectory before you even take a step. You possess an internal model of the world that allows you to simulate the future. ...

[Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models 🔗](https://arxiv.org/abs/2409.17146)

Breaking the Cycle of Distillation: How Molmo Builds State-of-the-Art VLMs from Scratch

Introduction In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) have become ubiquitous. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet can describe complex images, interpret charts, and answer questions about the visual world with startling accuracy. However, these proprietary models are “walled gardens.” We interact with them via APIs, but we don’t know exactly how they were built or what data they were trained on. ...

[MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos 🔗](https://arxiv.org/abs/2412.04463)

Taming the Chaos: How MegaSaM Solves 3D Reconstruction for Casual, Dynamic Videos

Imagine you are holding your smartphone, recording a video of your friend running down a beach or a car racing around a track. To you, the scene is clear. But to a computer trying to reconstruct that scene in 3D, it is a nightmare. For decades, Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM) algorithms have relied on two golden rules: the scene must be static (rigid), and the camera must move enough to create parallax (the effect where close objects move faster than far ones). Casual videos break both rules constantly. We have moving objects, we rotate cameras without moving our feet, and we film scenes where “dynamic” elements (like people or cars) dominate the view. ...

[FoundationStereo: Zero-Shot Stereo Matching 🔗](https://arxiv.org/abs/2501.09898)

FoundationStereo: Bringing Zero-Shot Generalization to Stereo Depth Estimation

FoundationStereo: Bringing Zero-Shot Generalization to Stereo Depth Estimation In the rapid evolution of computer vision, we have seen “Foundation Models” transform how machines understand images. Models like Segment Anything (SAM) or DepthAnything have demonstrated an incredible ability to generalize: they can perform tasks on images they have never seen before without needing specific fine-tuning. However, one corner of computer vision has lagged behind in this zero-shot revolution: Stereo Matching. Stereo matching—the process of estimating depth by comparing two images taken from slightly different viewpoints—has historically relied on training deep networks on specific datasets. A model trained on driving scenes (like KITTI) usually fails when tested on indoor scenes (like Middlebury). It’s a classic case of overfitting to the domain. ...

[Descriptor-In-Pixel : Point-Feature Tracking for Pixel Processor Arrays 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Bose_Descriptor-In-Pixel__Point-Feature_Tracking_For_Pixel_Processor_Arrays_CVPR_2025_paper.pdf)

Smart Sensors: How Computing Inside the Pixel enables 3000 FPS Feature Tracking

Smart Sensors: How Computing Inside the Pixel enables 3000 FPS Feature Tracking Computer vision has a bottleneck problem. In a traditional setup—whether it’s a smartphone, a VR headset, or a drone—the camera sensor acts as a “dumb” bucket. It captures millions of photons, converts them to digital values, and then sends a massive stream of raw data to an external processor (CPU or GPU) to figure out what it’s looking at. ...

[DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models 🔗](https://arxiv.org/abs/2503.01774)

Cleaning Up the Mess: How Single-Step Diffusion is Revolutionizing 3D Reconstruction

Introduction We are currently witnessing a golden age of neural rendering. Technologies like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have allowed us to turn a handful of 2D photographs into immersive, navigable 3D scenes. The results are often breathtaking—until you stray too far from the original camera path. As soon as you move the virtual camera to a “novel view”—an angle not seen during training—the illusion often breaks. You encounter “floaters” (spurious geometry hanging in the air), blurry textures, and ghostly artifacts. This happens because these regions are underconstrained; the 3D model simply doesn’t have enough data to know what should be there, so it guesses, often poorly. ...

[Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World 🔗](https://arxiv.org/abs/2505.04788)

Solved via Relaxation - A New Global Approach to Vanishing Point Estimation

If you look down a long, straight hallway or stare at a skyscraper from the street, you intuitively understand perspective. Parallel lines in the real world—like the edges of a ceiling or the sides of a building—appear to converge at a specific spot in the distance. In Computer Vision, these are called Vanishing Points (VPs). Locating these points is crucial for tasks like camera calibration, 3D reconstruction, and autonomous navigation. In structured environments (like cities or indoors), we often rely on the Manhattan World assumption, which posits that the world is built on three mutually orthogonal axes (up-down, left-right, forward-backward). ...

[3D Student Splatting and Scooping 🔗](https://arxiv.org/abs/2503.10148)

Beyond Gaussians: How Student's t-Distribution and Negative Density Revolutionize Neural Rendering

Introduction In the rapidly evolving world of computer graphics and computer vision, few techniques have made as much noise recently as 3D Gaussian Splatting (3DGS). It offered a brilliant alternative to Neural Radiance Fields (NeRFs), allowing for real-time rendering of complex scenes by representing them as millions of 3D Gaussian ellipses. It was fast, high-quality, and explicit. But as with any foundational technology, once the dust settled, researchers began to ask: Is the Gaussian distribution actually the best primitive for the job? ...