[DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_DiffCAM_Data-Driven_Saliency_Maps_by_Capturing_Feature_Differences_CVPR_2025_paper.pdf)

Beyond the Decision Boundary: How DiffCAM Unlocks AI Interpretability by Comparing Features

Introduction In the rapidly evolving landscape of Artificial Intelligence, Deep Neural Networks (DNNs) have achieved superhuman performance in tasks ranging from medical diagnosis to autonomous driving. However, these models suffer from a notorious flaw: they act as “black boxes.” We feed them data, and they give us an answer, but they rarely tell us why they reached that conclusion. In critical domains like healthcare and finance, “because the computer said so” is not an acceptable justification. This has given rise to the field of Explainable AI (XAI). One of the most popular tools in the XAI toolkit is the Saliency Map—a heatmap that highlights which parts of an image the model focused on to make its decision. ...

10 min · 1923 words
[DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness 🔗](https://arxiv.org/abs/2503.08257)

Solving the Butterfingers Problem: How Physics-Aware Diffusion Models are Mastering Universal Robotic Grasping

If you have ever watched a robot try to pick up an irregularly shaped object—say, a spray bottle or a stuffed animal—you likely noticed a hesitation. Unlike humans, who instinctively shape their hands to conform to an object’s geometry, robots often struggle with “dexterous grasping.” While simple parallel-jaw grippers (think of a claw machine) are reliable for boxes, the Holy Grail of robotics is a multi-fingered dexterous hand that can grasp anything. ...

2025-03 · 9 min · 1752 words
[Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Fang_Detection-Friendly_Nonuniformity_Correction_A_Union_Framework_for_Infrared_UAV_Target_CVPR_2025_paper.pdf)

Seeing Through the Noise: How UniCD Revolutionizes Infrared Drone Detection

Introduction In the rapidly evolving world of surveillance and security, Unmanned Aerial Vehicles (UAVs) present a unique challenge. They are small, agile, and often difficult to spot. Infrared (thermal) imaging has become the go-to solution for detecting these targets, offering visibility day and night regardless of lighting conditions. However, there is a catch: the hardware itself often gets in the way. Thermal detectors are sensitive devices. Heat generated by the camera’s own optical lens and housing creates a phenomenon known as temperature-dependent low-frequency nonuniformity, often referred to as a “bias field.” Imagine trying to spot a tiny bird through a window that is foggy around the edges and smeared with grease in the center. That is essentially what a computer vision algorithm deals with when processing raw infrared footage. This bias field reduces contrast and obscures the faint heat signatures of drones. ...

9 min · 1802 words
[Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection 🔗](https://arxiv.org/abs/2503.07978)

AlignIns: A Fine-Grained Directional Defense Against Backdoors in Federated Learning

Introduction Federated Learning (FL) has revolutionized how we train machine learning models. By allowing devices to train locally and share only model updates rather than raw data, FL promises a sweet spot between data utility and user privacy. It is currently powering applications in healthcare, finance, and the predictive text on your smartphone. However, this decentralized architecture introduces a significant security flaw: the central server cannot see the training data. This blindness makes FL susceptible to backdoor attacks (also known as poisoning attacks). In a backdoor attack, a malicious client injects a “Trojan horse” into the global model. The model behaves normally on standard data, but when presented with a specific trigger (like a pixel pattern or a specific phrase), it misclassifies the input exactly how the attacker wants. ...

2025-03 · 9 min · 1909 words
[DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos 🔗](https://arxiv.org/abs/2409.02095)

DepthCrafter: Mastering Temporal Consistency in Open-World Video Depth Estimation

Imagine trying to reconstruct a 3D world from a standard 2D video. For humans, this is intuitive; we understand that as a car moves forward, it gets closer, or that a tree passing by is distinct from the mountain in the background. For computers, however, this task—known as monocular video depth estimation—is notoriously difficult. While AI has made massive strides in estimating depth from single images (thanks to models like Depth Anything), applying these frame-by-frame methods to video results in a jarring “flickering” effect. The depth jumps around wildly because the model doesn’t understand that Frame 1 and Frame 2 are part of the same continuous reality. Existing solutions often rely on complex camera pose estimation or optical flow, which frequently break down in “open-world” videos where camera movements are unpredictable or dynamic objects (like running animals) dominate the scene. ...

2024-09 · 10 min · 1924 words
[DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection 🔗](https://arxiv.org/abs/2503.13985)

How to Train AI on Missing Data: A Deep Dive into DefectFill

Imagine you are running a high-tech manufacturing line for semiconductor chips or precision automotive parts. You want to automate quality control using AI. To train a model to spot defects (like a scratch on a lens or a crack in a hazelnut), you generally need thousands of examples of those defects. But here is the catch: modern manufacturing is extremely good. Defects are rare—often occurring in fewer than 1 in 10,000 items. You simply cannot collect enough data to train a supervised deep learning model effectively. ...

2025-03 · 9 min · 1757 words
[Deep Fair Multi-View Clustering with Attention KAN 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_Deep_Fair_Multi-View_Clustering_with_Attention_KAN_CVPR_2025_paper.pdf)

Breaking the Bias in AI Clustering with Attention KANs

In the world of machine learning, data rarely comes from a single source. Imagine a doctor diagnosing a patient: they don’t just look at a blood test. They look at X-rays, MRI scans, patient history, and genetic markers. This is Multi-View Data—different perspectives of the same underlying object. To make sense of this data without human labels, we use Multi-View Clustering (MVC). The goal is to group similar data points together by synthesizing information from all these different views. It is a powerful tool used in everything from bioinformatics to computer vision. ...

9 min · 1779 words
[Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection 🔗](https://arxiv.org/abs/2503.00643)

Beyond Euclidean Space: Monitoring Forests with Hyperbolic Deep Learning

Forests are the lungs of our planet. Effective forestry management is crucial not just for the timber industry, but for climate stability and ecological health. To manage a forest, you need to monitor it—measure growth, assess health, and identify damage. Traditionally, this has been done using satellite imagery or LiDAR. While useful, these methods have significant blind spots. Satellites often lack the resolution to see fine-grained changes in individual trees, and LiDAR, while great for structure, cannot capture the rich color and texture details needed to spot early signs of disease or blossoming. ...

2025-03 · 8 min · 1561 words
[Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks 🔗](https://arxiv.org/abs/2503.23751)

How to Make AI Forget: A Deep Dive into Decoupled Distillation (DELETE)

In the era of large-scale artificial intelligence, models are voracious learners. They ingest massive datasets, training on everything from web-crawled images to sensitive facial data. But what happens when a model knows too much? Imagine a scenario where a user exercises their “right to be forgotten,” demanding their photos be removed from a facial recognition system. Or consider a model inadvertently trained on copyrighted material or poisoned data that creates a security “backdoor.” In these cases, we face the challenge of Machine Unlearning. ...

2025-03 · 10 min · 2092 words
[Dataset Distillation with Neural Characteristic Function: A Minmax Perspective 🔗](https://arxiv.org/abs/2502.20653)

Shrinking Big Data: How Neural Characteristic Functions Revolutionize Dataset Distillation

Introduction In the era of deep learning, data is the new oil. But managing that oil is becoming an increasingly expensive logistical nightmare. Modern neural networks require massive datasets to train, leading to exorbitant storage costs and training times that can span weeks. This creates a high barrier to entry, often locking out students and researchers who don’t have access to industrial-grade computing clusters. What if you could take a massive dataset like ImageNet—millions of images—and “distill” it down to a tiny fraction of its size, while retaining almost all the information needed to train a model? ...

2025-02 · 9 min · 1829 words
[DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds 🔗](https://arxiv.org/abs/2503.18402)

From Minutes to Seconds: Accelerating 3D Gaussian Splatting with DashGaussian

If you have been following the field of 3D scene reconstruction, you are likely familiar with the rapid evolution from Neural Radiance Fields (NeRFs) to 3D Gaussian Splatting (3DGS). While NeRFs stunned the world with photorealistic view synthesis, they were notoriously slow to train, often requiring hours or days. 3DGS revolutionized this by representing scenes with explicit Gaussian primitives, cutting optimization times down to tens of minutes. But for many real-time applications, “tens of minutes” is still too long. Whether for interactive content creation, digital humans, or large-scale mapping, the holy grail is optimization that happens in seconds, not minutes. ...

2025-03 · 7 min · 1485 words
[DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection 🔗](https://arxiv.org/abs/2411.08227)

Can Your AI Trust Its Own Eyes? Solving the Consistency Problem in Multimodal OOD Detection

Imagine an autonomous vehicle navigating a busy city street. It has been trained on thousands of hours of driving footage—cars, pedestrians, cyclists, and traffic lights. Suddenly, a person wearing a giant, inflatable dinosaur costume runs across the crosswalk. The car’s camera sees a “pedestrian,” but the shape is wrong. The LiDAR sensor detects an obstacle, but the dimensions don’t match a human. The system is confused. This is the Out-of-Distribution (OOD) problem: the challenge of handling data that deviates significantly from what the model saw during training. ...

2024-11 · 9 min · 1865 words
[DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos 🔗](https://arxiv.org/abs/2503.08344)

Beyond Static Scenes: How DIV-FF Unlocks Dynamic Egocentric Understanding

If you have ever strapped a GoPro to your head while cooking or working, you know the resulting footage is chaotic. The camera shakes, your hands obscure the view, objects move, and the environment changes state (an onion becomes chopped onions). For a computer vision system, making sense of this “egocentric” (first-person) view is a nightmare. Traditional 3D reconstruction methods, like Neural Radiance Fields (NeRFs), usually assume the world is a statue—rigid and unchanging. On the other hand, video understanding models might grasp the action “cutting,” but they have no concept of the 3D space where it’s happening. ...

2025-03 · 9 min · 1753 words
[DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction 🔗](https://arxiv.org/abs/2503.09491)

When More Data Isn't Always Better: Mastering Nanoparticle Prediction with DAMM-Diffusion

Introduction In the fight against cancer, Nanoparticles (NPs) represent a futuristic and highly promising weapon. These microscopic carriers can be designed to deliver drugs directly to tumor sites, leveraging the “leaky” blood vessels of tumors to accumulate exactly where they are needed—a phenomenon known as the Enhanced Permeability and Retention (EPR) effect. However, simply injecting nanoparticles isn’t enough. To maximize therapeutic outcomes, doctors need to know exactly how these particles will distribute within a tumor. Will they reach the core? Will they stay on the periphery? This distribution is heavily influenced by the Tumor Microenvironment (TME), specifically the layout of blood vessels and cell nuclei. ...

2025-03 · 8 min · 1548 words
[Cubify Anything: Scaling Indoor 3D Object Detection 🔗](https://arxiv.org/abs/2412.04458)

Beyond Point Clouds: Scaling Indoor 3D Object Detection with Cubify Anything

Beyond Point Clouds: Scaling Indoor 3D Object Detection with Cubify Anything Imagine walking into a room. You don’t just see “chair,” “table,” and “floor.” You perceive a rich tapestry of items: a coffee mug on a coaster, a specific book on a shelf, a power strip tucked behind a cabinet. Humans understand scenes in high fidelity. However, the field of indoor 3D object detection has largely been stuck seeing the world in low resolution, focusing primarily on large, room-defining furniture while ignoring the clutter of daily life. ...

2024-12 · 9 min · 1864 words
[CrossOver: 3D Scene Cross-Modal Alignment 🔗](https://arxiv.org/abs/2502.15011)

Beyond Perfect Data: How CrossOver Aligns 3D Scenes with Missing Modalities

In the rapidly evolving world of Computer Vision, teaching machines to understand 3D spaces is a monumental challenge. We want robots to navigate construction sites, augmented reality glasses to overlay information on furniture, and digital assistants to understand complex spatial queries like “Find the kitchen with the island counter.” To do this, AI systems typically rely on multi-modal learning. They combine different types of data—RGB images, 3D point clouds, CAD models, and text descriptions—to build a robust understanding of the world. However, existing methods have a significant Achilles’ heel: they often assume the data is perfect. They require every object to be fully aligned across all modalities, with complete semantic labels. ...

2025-02 · 8 min · 1680 words
[Cross-modal Causal Relation Alignment for Video Question Grounding 🔗](https://arxiv.org/abs/2503.07635)

Beyond Shortcuts: How Causal Inference Improves Video Question Grounding

Introduction: The “Cheating” Student Problem in AI Imagine a student taking a history test. The question asks, “Why did the Industrial Revolution begin in Britain?” The student doesn’t actually know the answer, but they notice a pattern in previous tests: whenever the words “Britain” and “Revolution” appear, the answer is usually “Option C.” They pick C and get it right. Did the student learn history? No. They learned a statistical shortcut. ...

2025-03 · 8 min · 1666 words
[Cross-View Completion Models are Zero-shot Correspondence Estimators 🔗](https://arxiv.org/abs/2412.09072)

Why Your Inpainting Model is Secretly a Correspondence Expert: Unveiling ZeroCo

Why Your Inpainting Model is Secretly a Correspondence Expert: Unveiling ZeroCo If you have been following Computer Vision research lately, you know that “Masked Image Modeling” (like MAE) has revolutionized how models learn representations. The idea is simple: hide parts of an image and ask the model to fill in the blanks. But what happens when you extend this to two images? This is called Cross-View Completion (CVC). In this setup, a model looks at a source image to reconstruct a masked target image. To do this effectively, the model must implicitly understand the 3D geometry of the scene—it needs to know which pixel in the source corresponds to the missing pixel in the target. ...

2024-12 · 7 min · 1389 words
[Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting 🔗](https://arxiv.org/abs/2504.20403)

Unlocking Photorealistic 3D Avatar Editing: A Deep Dive into TetGS

Introduction In the rapidly evolving landscape of AR/VR and the metaverse, the demand for personalized, photorealistic 3D avatars is skyrocketing. We all want a digital twin that not only looks like us but can also change outfits as easily as we do in the real world. While recent advances in 3D Gaussian Splatting (3DGS) have allowed for incredible real-time rendering of static scenes, editing these representations remains a massive headache. If you have ever tried to “edit” a point cloud, you know the struggle: it lacks structure. On the other hand, traditional meshes are easy to edit but often struggle to capture the fuzzy, intricate details of real-world clothing and hair. ...

2025-04 · 7 min · 1476 words
[Context-Aware Multimodal Pretraining 🔗](https://arxiv.org/abs/2411.15099)

Bridging the Gap - How Context-Aware Pretraining Unlocks Few-Shot Potential

In the rapidly evolving landscape of computer vision and multimodal learning, models like CLIP and SigLIP have set the standard. By training on massive datasets of image-text pairs, these models learn robust representations that perform remarkably well on “zero-shot” tasks—classifying images they’ve never seen before simply by matching them to text descriptions. But there is a catch. While these models are generalists, they often struggle when we need them to be specialists. When a downstream task involves specific, fine-grained categories or a distribution of images that differs significantly from the web-scraped training data, zero-shot performance can plateau. To fix this, practitioners usually turn to few-shot adaptation: giving the model a handful of example images (shots) to learn from. ...

2024-11 · 9 min · 1808 words