CVPR 2025

[EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing 🔗](https://arxiv.org/abs/2412.08988)

Mastering the Art of Emotional Dubbing: A Deep Dive into EmoDubber

Have you ever watched a dubbed movie where the voice acting felt completely detached from the actor’s face? Perhaps the lips stopped moving, but the voice kept going, or the character on screen was screaming in rage while the dubbed voice sounded mildly annoyed. This disconnect breaks immersion instantly. This challenge falls under the domain of Visual Voice Cloning (V2C). The goal is to take a text script, a video clip of a speaker, and a reference audio track, and then generate speech that matches the video’s lip movements while cloning the reference speaker’s voice. ...

[EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision 🔗](https://arxiv.org/abs/2409.02224)

Feeling the Pressure: How EgoPressure Brings Touch to Egocentric Computer Vision

Introduction Imagine playing a piano in Virtual Reality. You see your digital hands hovering over the keys, but when you strike a chord, there is a disconnect. You don’t feel the resistance, and the system struggles to know exactly how hard you pressed. Or consider a robot attempting to pick up a plastic cup; without knowing the pressure it exerts, it might crush the cup or drop it. In the world of computer vision, we have become incredibly good at determining where things are (pose estimation) and what they are (object recognition). However, understanding the physical interaction—specifically touch contact and pressure—remains a massive challenge. This is particularly difficult in “egocentric” vision (first-person perspective), which is the standard view for AR/VR headsets and humanoid robots. ...

[Efficient Motion-Aware Video MLLM 🔗](https://arxiv.org/abs/2503.13016)

Stop Sampling Frames: How Compressed Video Structures Make AI Faster and Smarter

Introduction If you have ever tried to build a computer vision system that understands video, you have likely encountered the “sampling dilemma.” Videos are essentially heavy stacks of images. To process a video using a Multimodal Large Language Model (MLLM), the standard approach is Uniform Frame Sampling. You extract one frame every second (or every few frames), encode each one as an image, stack them up, and feed them into the model. ...

[EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Rahman_EffiDec3D_An_Optimized_Decoder_for_High-Performance_and_Efficient_3D_Medical_CVPR_2025_paper.pdf)

Cutting the Fat: How EffiDec3D Revolutionizes 3D Medical Image Segmentation

Introduction In the world of medical artificial intelligence, precision is everything. A fraction of a millimeter can distinguish between a benign anomaly and a malignant tumor. Over the last few years, Deep Learning—specifically U-shaped architectures and Vision Transformers—has become the gold standard for automating this segmentation process. However, this precision comes at a steep price. Modern State-of-the-Art (SOTA) models for 3D medical imaging, such as SwinUNETR or 3D UX-Net, are computationally massive. They require expensive GPUs with high memory, making them difficult to deploy in real-time clinical settings or on edge devices like portable ultrasound machines. ...

[ETAP: Event-based Tracking of Any Point 🔗](https://arxiv.org/abs/2412.00133)

Seeing the Unseen: How Event Cameras are Revolutionizing Point Tracking

Seeing the Unseen: How Event Cameras are Revolutionizing Point Tracking Imagine trying to track a specific point on the blade of a rapidly spinning fan. Or perhaps you are trying to follow a bird diving into a dark shadow. If you use a standard video camera, you will likely run into two major walls: motion blur and dynamic range limitations. The fan blade becomes a smear, and the bird disappears into the darkness. ...

[ESC: Erasing Space Concept for Knowledge Deletion 🔗](https://arxiv.org/abs/2504.02199)

True Forgetting: Deleting Deep Learning Knowledge with Erasing Space Concept (ESC)

Introduction In the era of GDPR and increasing privacy concerns, the “right to be forgotten” has become a critical requirement for technology companies. For deep learning, this poses a massive engineering challenge. If a user requests their data be removed from a trained AI model, how do we ensure the model actually “forgets” them? The standard approach is Machine Unlearning (MU). The goal is to update a model to look as if it never saw specific data, without having to retrain the whole thing from scratch (which is expensive and slow). However, recent research reveals a disturbing reality: most current unlearning methods are superficial. They might change the model’s final output, but the sensitive “knowledge” often remains hidden deep within the neural network’s feature extractor. ...

[EBS-EKF: Accurate and High Frequency Event-based Star Tracking 🔗](https://arxiv.org/abs/2503.20101)

Navigating the Stars at 1000 Hz: How Event Cameras are Revolutionizing Spacecraft Attitude Control

Introduction For centuries, sailors navigated the open oceans by looking up at the stars. Today, spacecraft orbiting Earth and traveling through the solar system do exactly the same thing. By identifying specific star patterns, a satellite determines its precise orientation in space—known as its “attitude.” This process is handled by a device called a Star Tracker. Standard star trackers use Active Pixel Sensors (APS)—essentially the same technology found in your smartphone camera. They take a picture of the sky, process the frame to find stars, identify them against a catalog, and calculate the angle. While highly accurate, they are slow. Most operate at 2-10 Hz (frames per second). If a satellite needs to spin quickly or stabilize against rapid vibrations, standard star trackers blur the image and fail, leading to a loss of navigation. ...

[Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera 🔗](https://arxiv.org/abs/2412.12861)

Untangling the Chaos: How Dyn-HaMR Solves Hand Motion with Moving Cameras

Introduction In the rapidly evolving worlds of Augmented Reality (AR), Virtual Reality (VR), and robotics, understanding human movement is fundamental. We have become quite good at tracking bodies and hands when the camera is sitting still on a tripod. But the real world is dynamic. In egocentric scenarios—like wearing smart glasses or a GoPro—the camera moves with you. This creates a chaotic blend of motion. If a hand appears to move left in a video, is the hand actually moving left, or is the user’s head turning right? ...

[DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding 🔗](https://arxiv.org/abs/2504.14920)

How to Teach AI to Focus: A Deep Dive into DyFo and Training-Free Visual Search

Introduction Imagine you are trying to find a specific friend in a crowded stadium. You don’t stare at the entire stadium at once and hope to instantly process every single face. Instead, your eyes dart around. You scan sections, focus on a group of people wearing the right color jersey, zoom in on a specific row, and filter out the noise. This cognitive process is known as visual search, and it is fundamental to how humans interact with the world. We dynamically adjust our focus to filter out irrelevant information and concentrate on what matters. ...

[DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction 🔗](https://arxiv.org/abs/2412.04464)

Seeing Double: How Dual Point Maps Revolutionize 3D Object Reconstruction

Reconstructing a 3D object from a single 2D image is one of computer vision’s classic “ill-posed” problems. When you look at a photograph of a galloping horse, your brain instantly understands the 3D shape, the articulation of the limbs, and the parts of the animal that are hidden from view. For a computer, however, inferring this geometry from a grid of pixels is incredibly difficult, especially when the object is deformable—meaning it can bend, stretch, and move (like animals or humans). ...

[DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery 🔗](https://arxiv.org/abs/2503.16964)

Taming the Wild: How DroneSplat Brings Robust 3D Reconstruction to Drone Imagery

Introduction Drones have revolutionized how we capture the world. From inspecting massive bridges to surveying urban landscapes and preserving cultural heritage, the ability to put a camera anywhere in 3D space is invaluable. However, turning those aerial photos into accurate, photorealistic 3D models is a computational nightmare, especially when the real world refuses to sit still. Traditional methods like photogrammetry and recent Neural Radiance Fields (NeRFs) have pushed the boundaries of what is possible. More recently, 3D Gaussian Splatting (3DGS) has taken the field by storm, offering real-time rendering speeds that NeRFs could only dream of. But there is a catch: most of these algorithms assume the world is perfectly static and that we have perfect camera coverage. ...

[Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map 🔗](https://arxiv.org/abs/2410.23780)

Beyond Geometry: Teaching Autonomous Vehicles to Read Traffic Rules

Imagine you are driving down a busy urban street. You see a lane marked with a solid white line, but overhead, a blue sign indicates “Bus Lane: 7:00-9:00, 17:00-19:00.” You glance at the clock; it’s 10:30 AM. You confidently merge into the lane. This decision-making process—perceiving the geometry of the road, reading a sign, understanding the temporal rule, and associating that rule with a specific lane—is second nature to humans. For autonomous vehicles (AVs), however, it is a surprisingly complex challenge. ...

[DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_DriveGPT4-V2_Harnessing_Large_Language_Model_Capabilities_for_Enhanced_Closed-Loop_Autonomous_CVPR_2025_paper.pdf)

Can LLMs Actually Drive? Inside DriveGPT4-V2’s Closed-Loop Control System

Introduction The dream of autonomous driving has been fueled by rapid advancements in Artificial Intelligence. For years, the industry relied on modular pipelines—separate systems for detecting lanes, identifying pedestrians, planning routes, and controlling the steering wheel. However, the field is shifting toward end-to-end learning, where a single neural network takes raw sensor data and outputs driving commands. Simultaneously, we have witnessed the explosion of Large Language Models (LLMs) like GPT-4 and LLaMA. These models possess incredible reasoning capabilities and vast amounts of pretrained knowledge about the world. It begs the question: Can we put an LLM in the driver’s seat? ...

[Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration 🔗](https://arxiv.org/abs/2502.16652)

Dr. Splat: A Prescription for Faster, Semantic 3D Scene Understanding

Imagine walking into a room and asking a robot, “Find the red mug near the sink.” To us, this is effortless. To a computer vision system, it requires bridging the gap between 2D visual data, 3D spatial geometry, and natural language. This is the challenge of Open-Vocabulary 3D Scene Understanding. In recent years, 3D Gaussian Splatting (3DGS) has revolutionized how we represent 3D scenes. It offers high-quality rendering by representing scenes as millions of 3D Gaussian blobs. However, attaching semantic meaning (language) to these blobs has been a bottleneck. Existing methods rely on rendering 2D feature maps to “teach” the 3D model what it is looking at. This process is computationally expensive, slow to search, and often results in blurry or inaccurate semantic features. ...

[Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features 🔗](https://arxiv.org/abs/2412.05826)

Solving the Evil Twin Problem in 3D Computer Vision with Doppelgangers++

Introduction Imagine you are trying to build a 3D model of a cathedral using hundreds of photographs taken by tourists. You feed these images into a computer, and the software begins matching features: the curve of a window here, the texture of a brick there. But there is a problem. The cathedral is symmetrical. The north face looks almost identical to the south face. To a human, context clues reveal that these are two different sides of the building. To an algorithm, they look like the same surface. The software, confused by this “visual aliasing,” stitches the north and south faces together. The resulting 3D model collapses in on itself, creating a distorted, impossible geometry. ...

[Doppelgängers and Adversarial Vulnerability 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Kamberov_Doppelgangers_and_Adversarial_Vulnerability_CVPR_2025_paper.pdf)

When Machines See What We Can't: Understanding Adversarial Doppelgängers

When Machines See What We Can’t: Understanding Adversarial Doppelgängers Imagine you are looking at a picture of a cat. It’s a tabby cat. You are absolutely certain of it. Now, imagine a computer looks at the exact same picture and confidently tells you it’s a Persian cat. You squint. You zoom in. You check every pixel. To your human eyes, nothing has changed. This isn’t the typical story of “adversarial examples” where someone adds static noise to a photo of a panda to make an AI think it’s a gibbon. This is something subtler and more profound. This is the phenomenon of Adversarial Doppelgängers. ...

[Do computer vision foundation models learn the low-level characteristics of the human visual system? 🔗](https://arxiv.org/abs/2502.20256)

Artificial Eyes vs. Human Eyes: Do Foundation Models See Like Us?

Artificial Eyes vs. Human Eyes: Do Foundation Models See Like Us? In the rapidly evolving world of computer vision, we have witnessed a massive shift toward “Foundation Models.” Giants like DINOv2, OpenCLIP, and Segment Anything (SAM) are trained on billions of natural images, learning to recognize objects, segment scenes, and understand visual concepts with uncanny accuracy. These models are self-supervised; they learn by looking at the world, much like a human infant does during development. ...

[DistinctAD: Distinctive Audio Description Generation in Contexts 🔗](https://arxiv.org/abs/2411.18180)

Beyond "He Looks": Generating Distinctive Audio Descriptions for Movies with DistinctAD

Beyond “He Looks”: Generating Distinctive Audio Descriptions for Movies with DistinctAD Imagine watching a movie with your eyes closed. You are relying entirely on a narrator to describe the action. Now, imagine a tense scene where a character slowly realizes they are being watched. The narrator says: “He looks.” A few seconds later: “He looks at something.” Then: “He looks again.” Frustrating, right? You miss the nuance—the widening of the eyes, the glance at the pill bottle, the shadowy figure in the doorway. This is the reality of many automated Audio Description (AD) systems today. While they can identify a person and a general action, they often fail to capture the specific, distinctive details that drive a narrative forward. ...

[Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset 🔗](https://arxiv.org/abs/2504.08541)

Bridging Reality and Simulation: A Deep Dive into the Digital Twin Catalog (DTC)

Bridging Reality and Simulation: A Deep Dive into the Digital Twin Catalog (DTC) In the rapidly evolving worlds of Augmented Reality (AR), Virtual Reality (VR), and robotics, one concept stands as the “holy grail”: the Digital Twin. A digital twin isn’t just a 3D model. A 3D model might be a hollow shell that looks roughly like a cup. A digital twin, however, is a virtual entity so precise that it is indistinguishable from its physical counterpart. It captures exact geometry, surface textures, how light interacts with the material (reflectance), and physical properties. ...

[Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_Diffusion-based_Realistic_Listening_Head_Generation_via_Hybrid_Motion_Modeling_CVPR_2025_paper.pdf)

The Art of Listening: How Diffusion Models Are Revolutionizing Digital Avatars

The Art of Listening: How Diffusion Models Are Revolutionizing Digital Avatars In the world of digital human generation, we often focus on the speaker. We want avatars that can talk, lip-sync perfectly, and deliver speeches with emotion. But communication is a two-way street. Think about your last video call: while you were talking, what was the other person doing? They were nodding, smiling, frowning, or perhaps tilting their head in confusion. These non-verbal cues are essential for natural interaction. ...