Papers

[Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration 🔗](https://arxiv.org/abs/2502.16652)

Dr. Splat: A Prescription for Faster, Semantic 3D Scene Understanding

Imagine walking into a room and asking a robot, “Find the red mug near the sink.” To us, this is effortless. To a computer vision system, it requires bridging the gap between 2D visual data, 3D spatial geometry, and natural language. This is the challenge of Open-Vocabulary 3D Scene Understanding. In recent years, 3D Gaussian Splatting (3DGS) has revolutionized how we represent 3D scenes. It offers high-quality rendering by representing scenes as millions of 3D Gaussian blobs. However, attaching semantic meaning (language) to these blobs has been a bottleneck. Existing methods rely on rendering 2D feature maps to “teach” the 3D model what it is looking at. This process is computationally expensive, slow to search, and often results in blurry or inaccurate semantic features. ...

[Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features 🔗](https://arxiv.org/abs/2412.05826)

Solving the Evil Twin Problem in 3D Computer Vision with Doppelgangers++

Introduction Imagine you are trying to build a 3D model of a cathedral using hundreds of photographs taken by tourists. You feed these images into a computer, and the software begins matching features: the curve of a window here, the texture of a brick there. But there is a problem. The cathedral is symmetrical. The north face looks almost identical to the south face. To a human, context clues reveal that these are two different sides of the building. To an algorithm, they look like the same surface. The software, confused by this “visual aliasing,” stitches the north and south faces together. The resulting 3D model collapses in on itself, creating a distorted, impossible geometry. ...

[Doppelgängers and Adversarial Vulnerability 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Kamberov_Doppelgangers_and_Adversarial_Vulnerability_CVPR_2025_paper.pdf)

When Machines See What We Can't: Understanding Adversarial Doppelgängers

When Machines See What We Can’t: Understanding Adversarial Doppelgängers Imagine you are looking at a picture of a cat. It’s a tabby cat. You are absolutely certain of it. Now, imagine a computer looks at the exact same picture and confidently tells you it’s a Persian cat. You squint. You zoom in. You check every pixel. To your human eyes, nothing has changed. This isn’t the typical story of “adversarial examples” where someone adds static noise to a photo of a panda to make an AI think it’s a gibbon. This is something subtler and more profound. This is the phenomenon of Adversarial Doppelgängers. ...

[Do computer vision foundation models learn the low-level characteristics of the human visual system? 🔗](https://arxiv.org/abs/2502.20256)

Artificial Eyes vs. Human Eyes: Do Foundation Models See Like Us?

Artificial Eyes vs. Human Eyes: Do Foundation Models See Like Us? In the rapidly evolving world of computer vision, we have witnessed a massive shift toward “Foundation Models.” Giants like DINOv2, OpenCLIP, and Segment Anything (SAM) are trained on billions of natural images, learning to recognize objects, segment scenes, and understand visual concepts with uncanny accuracy. These models are self-supervised; they learn by looking at the world, much like a human infant does during development. ...

[DistinctAD: Distinctive Audio Description Generation in Contexts 🔗](https://arxiv.org/abs/2411.18180)

Beyond "He Looks": Generating Distinctive Audio Descriptions for Movies with DistinctAD

Beyond “He Looks”: Generating Distinctive Audio Descriptions for Movies with DistinctAD Imagine watching a movie with your eyes closed. You are relying entirely on a narrator to describe the action. Now, imagine a tense scene where a character slowly realizes they are being watched. The narrator says: “He looks.” A few seconds later: “He looks at something.” Then: “He looks again.” Frustrating, right? You miss the nuance—the widening of the eyes, the glance at the pill bottle, the shadowy figure in the doorway. This is the reality of many automated Audio Description (AD) systems today. While they can identify a person and a general action, they often fail to capture the specific, distinctive details that drive a narrative forward. ...

[Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset 🔗](https://arxiv.org/abs/2504.08541)

Bridging Reality and Simulation: A Deep Dive into the Digital Twin Catalog (DTC)

Bridging Reality and Simulation: A Deep Dive into the Digital Twin Catalog (DTC) In the rapidly evolving worlds of Augmented Reality (AR), Virtual Reality (VR), and robotics, one concept stands as the “holy grail”: the Digital Twin. A digital twin isn’t just a 3D model. A 3D model might be a hollow shell that looks roughly like a cup. A digital twin, however, is a virtual entity so precise that it is indistinguishable from its physical counterpart. It captures exact geometry, surface textures, how light interacts with the material (reflectance), and physical properties. ...

[Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_Diffusion-based_Realistic_Listening_Head_Generation_via_Hybrid_Motion_Modeling_CVPR_2025_paper.pdf)

The Art of Listening: How Diffusion Models Are Revolutionizing Digital Avatars

The Art of Listening: How Diffusion Models Are Revolutionizing Digital Avatars In the world of digital human generation, we often focus on the speaker. We want avatars that can talk, lip-sync perfectly, and deliver speeches with emotion. But communication is a two-way street. Think about your last video call: while you were talking, what was the other person doing? They were nodding, smiling, frowning, or perhaps tilting their head in confusion. These non-verbal cues are essential for natural interaction. ...

[DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_DiffCAM_Data-Driven_Saliency_Maps_by_Capturing_Feature_Differences_CVPR_2025_paper.pdf)

Beyond the Decision Boundary: How DiffCAM Unlocks AI Interpretability by Comparing Features

Introduction In the rapidly evolving landscape of Artificial Intelligence, Deep Neural Networks (DNNs) have achieved superhuman performance in tasks ranging from medical diagnosis to autonomous driving. However, these models suffer from a notorious flaw: they act as “black boxes.” We feed them data, and they give us an answer, but they rarely tell us why they reached that conclusion. In critical domains like healthcare and finance, “because the computer said so” is not an acceptable justification. This has given rise to the field of Explainable AI (XAI). One of the most popular tools in the XAI toolkit is the Saliency Map—a heatmap that highlights which parts of an image the model focused on to make its decision. ...

[DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness 🔗](https://arxiv.org/abs/2503.08257)

Solving the Butterfingers Problem: How Physics-Aware Diffusion Models are Mastering Universal Robotic Grasping

If you have ever watched a robot try to pick up an irregularly shaped object—say, a spray bottle or a stuffed animal—you likely noticed a hesitation. Unlike humans, who instinctively shape their hands to conform to an object’s geometry, robots often struggle with “dexterous grasping.” While simple parallel-jaw grippers (think of a claw machine) are reliable for boxes, the Holy Grail of robotics is a multi-fingered dexterous hand that can grasp anything. ...

[Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Fang_Detection-Friendly_Nonuniformity_Correction_A_Union_Framework_for_Infrared_UAV_Target_CVPR_2025_paper.pdf)

Seeing Through the Noise: How UniCD Revolutionizes Infrared Drone Detection

Introduction In the rapidly evolving world of surveillance and security, Unmanned Aerial Vehicles (UAVs) present a unique challenge. They are small, agile, and often difficult to spot. Infrared (thermal) imaging has become the go-to solution for detecting these targets, offering visibility day and night regardless of lighting conditions. However, there is a catch: the hardware itself often gets in the way. Thermal detectors are sensitive devices. Heat generated by the camera’s own optical lens and housing creates a phenomenon known as temperature-dependent low-frequency nonuniformity, often referred to as a “bias field.” Imagine trying to spot a tiny bird through a window that is foggy around the edges and smeared with grease in the center. That is essentially what a computer vision algorithm deals with when processing raw infrared footage. This bias field reduces contrast and obscures the faint heat signatures of drones. ...

[Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection 🔗](https://arxiv.org/abs/2503.07978)

AlignIns: A Fine-Grained Directional Defense Against Backdoors in Federated Learning

Introduction Federated Learning (FL) has revolutionized how we train machine learning models. By allowing devices to train locally and share only model updates rather than raw data, FL promises a sweet spot between data utility and user privacy. It is currently powering applications in healthcare, finance, and the predictive text on your smartphone. However, this decentralized architecture introduces a significant security flaw: the central server cannot see the training data. This blindness makes FL susceptible to backdoor attacks (also known as poisoning attacks). In a backdoor attack, a malicious client injects a “Trojan horse” into the global model. The model behaves normally on standard data, but when presented with a specific trigger (like a pixel pattern or a specific phrase), it misclassifies the input exactly how the attacker wants. ...

[DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos 🔗](https://arxiv.org/abs/2409.02095)

DepthCrafter: Mastering Temporal Consistency in Open-World Video Depth Estimation

Imagine trying to reconstruct a 3D world from a standard 2D video. For humans, this is intuitive; we understand that as a car moves forward, it gets closer, or that a tree passing by is distinct from the mountain in the background. For computers, however, this task—known as monocular video depth estimation—is notoriously difficult. While AI has made massive strides in estimating depth from single images (thanks to models like Depth Anything), applying these frame-by-frame methods to video results in a jarring “flickering” effect. The depth jumps around wildly because the model doesn’t understand that Frame 1 and Frame 2 are part of the same continuous reality. Existing solutions often rely on complex camera pose estimation or optical flow, which frequently break down in “open-world” videos where camera movements are unpredictable or dynamic objects (like running animals) dominate the scene. ...

[DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection 🔗](https://arxiv.org/abs/2503.13985)

How to Train AI on Missing Data: A Deep Dive into DefectFill

Imagine you are running a high-tech manufacturing line for semiconductor chips or precision automotive parts. You want to automate quality control using AI. To train a model to spot defects (like a scratch on a lens or a crack in a hazelnut), you generally need thousands of examples of those defects. But here is the catch: modern manufacturing is extremely good. Defects are rare—often occurring in fewer than 1 in 10,000 items. You simply cannot collect enough data to train a supervised deep learning model effectively. ...

[Deep Fair Multi-View Clustering with Attention KAN 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_Deep_Fair_Multi-View_Clustering_with_Attention_KAN_CVPR_2025_paper.pdf)

Breaking the Bias in AI Clustering with Attention KANs

In the world of machine learning, data rarely comes from a single source. Imagine a doctor diagnosing a patient: they don’t just look at a blood test. They look at X-rays, MRI scans, patient history, and genetic markers. This is Multi-View Data—different perspectives of the same underlying object. To make sense of this data without human labels, we use Multi-View Clustering (MVC). The goal is to group similar data points together by synthesizing information from all these different views. It is a powerful tool used in everything from bioinformatics to computer vision. ...

[Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection 🔗](https://arxiv.org/abs/2503.00643)

Beyond Euclidean Space: Monitoring Forests with Hyperbolic Deep Learning

Forests are the lungs of our planet. Effective forestry management is crucial not just for the timber industry, but for climate stability and ecological health. To manage a forest, you need to monitor it—measure growth, assess health, and identify damage. Traditionally, this has been done using satellite imagery or LiDAR. While useful, these methods have significant blind spots. Satellites often lack the resolution to see fine-grained changes in individual trees, and LiDAR, while great for structure, cannot capture the rich color and texture details needed to spot early signs of disease or blossoming. ...

[Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks 🔗](https://arxiv.org/abs/2503.23751)

How to Make AI Forget: A Deep Dive into Decoupled Distillation (DELETE)

In the era of large-scale artificial intelligence, models are voracious learners. They ingest massive datasets, training on everything from web-crawled images to sensitive facial data. But what happens when a model knows too much? Imagine a scenario where a user exercises their “right to be forgotten,” demanding their photos be removed from a facial recognition system. Or consider a model inadvertently trained on copyrighted material or poisoned data that creates a security “backdoor.” In these cases, we face the challenge of Machine Unlearning. ...

[Dataset Distillation with Neural Characteristic Function: A Minmax Perspective 🔗](https://arxiv.org/abs/2502.20653)

Shrinking Big Data: How Neural Characteristic Functions Revolutionize Dataset Distillation

Introduction In the era of deep learning, data is the new oil. But managing that oil is becoming an increasingly expensive logistical nightmare. Modern neural networks require massive datasets to train, leading to exorbitant storage costs and training times that can span weeks. This creates a high barrier to entry, often locking out students and researchers who don’t have access to industrial-grade computing clusters. What if you could take a massive dataset like ImageNet—millions of images—and “distill” it down to a tiny fraction of its size, while retaining almost all the information needed to train a model? ...

[DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds 🔗](https://arxiv.org/abs/2503.18402)

From Minutes to Seconds: Accelerating 3D Gaussian Splatting with DashGaussian

If you have been following the field of 3D scene reconstruction, you are likely familiar with the rapid evolution from Neural Radiance Fields (NeRFs) to 3D Gaussian Splatting (3DGS). While NeRFs stunned the world with photorealistic view synthesis, they were notoriously slow to train, often requiring hours or days. 3DGS revolutionized this by representing scenes with explicit Gaussian primitives, cutting optimization times down to tens of minutes. But for many real-time applications, “tens of minutes” is still too long. Whether for interactive content creation, digital humans, or large-scale mapping, the holy grail is optimization that happens in seconds, not minutes. ...

[DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection 🔗](https://arxiv.org/abs/2411.08227)

Can Your AI Trust Its Own Eyes? Solving the Consistency Problem in Multimodal OOD Detection

Imagine an autonomous vehicle navigating a busy city street. It has been trained on thousands of hours of driving footage—cars, pedestrians, cyclists, and traffic lights. Suddenly, a person wearing a giant, inflatable dinosaur costume runs across the crosswalk. The car’s camera sees a “pedestrian,” but the shape is wrong. The LiDAR sensor detects an obstacle, but the dimensions don’t match a human. The system is confused. This is the Out-of-Distribution (OOD) problem: the challenge of handling data that deviates significantly from what the model saw during training. ...

[DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos 🔗](https://arxiv.org/abs/2503.08344)

Beyond Static Scenes: How DIV-FF Unlocks Dynamic Egocentric Understanding

If you have ever strapped a GoPro to your head while cooking or working, you know the resulting footage is chaotic. The camera shakes, your hands obscure the view, objects move, and the environment changes state (an onion becomes chopped onions). For a computer vision system, making sense of this “egocentric” (first-person) view is a nightmare. Traditional 3D reconstruction methods, like Neural Radiance Fields (NeRFs), usually assume the world is a statue—rigid and unchanging. On the other hand, video understanding models might grasp the action “cutting,” but they have no concept of the 3D space where it’s happening. ...