Papers

[DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction 🔗](https://arxiv.org/abs/2503.09491)

When More Data Isn't Always Better: Mastering Nanoparticle Prediction with DAMM-Diffusion

Introduction In the fight against cancer, Nanoparticles (NPs) represent a futuristic and highly promising weapon. These microscopic carriers can be designed to deliver drugs directly to tumor sites, leveraging the “leaky” blood vessels of tumors to accumulate exactly where they are needed—a phenomenon known as the Enhanced Permeability and Retention (EPR) effect. However, simply injecting nanoparticles isn’t enough. To maximize therapeutic outcomes, doctors need to know exactly how these particles will distribute within a tumor. Will they reach the core? Will they stay on the periphery? This distribution is heavily influenced by the Tumor Microenvironment (TME), specifically the layout of blood vessels and cell nuclei. ...

[Cubify Anything: Scaling Indoor 3D Object Detection 🔗](https://arxiv.org/abs/2412.04458)

Beyond Point Clouds: Scaling Indoor 3D Object Detection with Cubify Anything

Beyond Point Clouds: Scaling Indoor 3D Object Detection with Cubify Anything Imagine walking into a room. You don’t just see “chair,” “table,” and “floor.” You perceive a rich tapestry of items: a coffee mug on a coaster, a specific book on a shelf, a power strip tucked behind a cabinet. Humans understand scenes in high fidelity. However, the field of indoor 3D object detection has largely been stuck seeing the world in low resolution, focusing primarily on large, room-defining furniture while ignoring the clutter of daily life. ...

[CrossOver: 3D Scene Cross-Modal Alignment 🔗](https://arxiv.org/abs/2502.15011)

Beyond Perfect Data: How CrossOver Aligns 3D Scenes with Missing Modalities

In the rapidly evolving world of Computer Vision, teaching machines to understand 3D spaces is a monumental challenge. We want robots to navigate construction sites, augmented reality glasses to overlay information on furniture, and digital assistants to understand complex spatial queries like “Find the kitchen with the island counter.” To do this, AI systems typically rely on multi-modal learning. They combine different types of data—RGB images, 3D point clouds, CAD models, and text descriptions—to build a robust understanding of the world. However, existing methods have a significant Achilles’ heel: they often assume the data is perfect. They require every object to be fully aligned across all modalities, with complete semantic labels. ...

[Cross-modal Causal Relation Alignment for Video Question Grounding 🔗](https://arxiv.org/abs/2503.07635)

Beyond Shortcuts: How Causal Inference Improves Video Question Grounding

Introduction: The “Cheating” Student Problem in AI Imagine a student taking a history test. The question asks, “Why did the Industrial Revolution begin in Britain?” The student doesn’t actually know the answer, but they notice a pattern in previous tests: whenever the words “Britain” and “Revolution” appear, the answer is usually “Option C.” They pick C and get it right. Did the student learn history? No. They learned a statistical shortcut. ...

[Cross-View Completion Models are Zero-shot Correspondence Estimators 🔗](https://arxiv.org/abs/2412.09072)

Why Your Inpainting Model is Secretly a Correspondence Expert: Unveiling ZeroCo

Why Your Inpainting Model is Secretly a Correspondence Expert: Unveiling ZeroCo If you have been following Computer Vision research lately, you know that “Masked Image Modeling” (like MAE) has revolutionized how models learn representations. The idea is simple: hide parts of an image and ask the model to fill in the blanks. But what happens when you extend this to two images? This is called Cross-View Completion (CVC). In this setup, a model looks at a source image to reconstruct a masked target image. To do this effectively, the model must implicitly understand the 3D geometry of the scene—it needs to know which pixel in the source corresponds to the missing pixel in the target. ...

Unlocking Photorealistic 3D Avatar Editing: A Deep Dive into TetGS

Introduction In the rapidly evolving landscape of AR/VR and the metaverse, the demand for personalized, photorealistic 3D avatars is skyrocketing. We all want a digital twin that not only looks like us but can also change outfits as easily as we do in the real world. While recent advances in 3D Gaussian Splatting (3DGS) have allowed for incredible real-time rendering of static scenes, editing these representations remains a massive headache. If you have ever tried to “edit” a point cloud, you know the struggle: it lacks structure. On the other hand, traditional meshes are easy to edit but often struggle to capture the fuzzy, intricate details of real-world clothing and hair. ...

[Context-Aware Multimodal Pretraining 🔗](https://arxiv.org/abs/2411.15099)

Bridging the Gap - How Context-Aware Pretraining Unlocks Few-Shot Potential

In the rapidly evolving landscape of computer vision and multimodal learning, models like CLIP and SigLIP have set the standard. By training on massive datasets of image-text pairs, these models learn robust representations that perform remarkably well on “zero-shot” tasks—classifying images they’ve never seen before simply by matching them to text descriptions. But there is a catch. While these models are generalists, they often struggle when we need them to be specialists. When a downstream task involves specific, fine-grained categories or a distribution of images that differs significantly from the web-scraped training data, zero-shot performance can plateau. To fix this, practitioners usually turn to few-shot adaptation: giving the model a handful of example images (shots) to learn from. ...

[Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers 🔗](https://arxiv.org/abs/2507.04388)

Unlocking the Black Box: How CoIBA Interprets Vision Transformers Using a Comprehensive Information Bottleneck

Introduction In the rapidly evolving landscape of computer vision, the Vision Transformer (ViT) has emerged as a powerhouse. From self-driving cars to medical imaging, ViTs are achieving remarkable performance, often outperforming traditional Convolutional Neural Networks (CNNs). However, like many deep learning models, they suffer from a significant drawback: they act as “black boxes.” We feed an image in, and a classification comes out, but we often have little insight into why the model made that decision. ...

[Compositional Caching for Training-free Open-vocabulary Attribute Detection 🔗](https://arxiv.org/abs/2503.19145)

Beyond the Label: How Compositional Caching Revolutionizes Attribute Detection Without Training

Introduction: The Complexity of “Simple” Description In computer vision, identifying an object—say, a “car”—is a problem that has largely been solved. We have robust models that can spot a car in a crowded street with high accuracy. But what if we want to go deeper? What if we need to know if the car is rusty, wet, metallic, or vintage? This is the challenge of Attribute Detection. Unlike object classification, which deals with concrete nouns, attribute detection deals with adjectives. These properties shape how we perceive the world, but they are notoriously difficult for AI models to grasp. ...

[Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models 🔗](https://arxiv.org/abs/2503.18337)

Beyond Weights: Fine-Tuning Transformers by Remixing Attention Heads

If you have ever tried to fine-tune a Large Language Model (LLM) or a massive Vision Transformer (ViT), you know the struggle: these models are heavy. Full-parameter fine-tuning is computationally expensive and memory-intensive. To solve this, the community turned to Parameter-Efficient Fine-Tuning (PEFT). The most famous example is LoRA (Low-Rank Adaptation), which freezes the pre-trained model and injects small, trainable rank decomposition matrices. Most of these methods focus on the linear projection layers—the weights (\(W_q, W_k, W_v\)) that transform your data. ...

[CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_CoSER_Towards_Consistent_Dense_Multiview_Text-to-Image_Generator_for_3D_Creation_CVPR_2025_paper.pdf)

Solved: The Janus Problem? How CoSER Brings Consistency to Text-to-3D Generation

Imagine typing “a bear dressed in medieval armor” into a computer, and seconds later, receiving a fully rotatable, high-quality 3D asset ready for a video game. This is the dream of Text-to-3D generation. While we have mastered 2D image generation (thanks to tools like Midjourney and Stable Diffusion), lifting this capability to 3D dimensions remains surprisingly difficult. A common failure mode is the “Janus problem”—named after the two-faced Roman god—where a generated model might have a face on both the front and the back of its head because the model doesn’t understand that the back view shouldn’t look like the front view. ...

[CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation 🔗](https://arxiv.org/abs/2406.10462)

The Devil is in the Data: How CoMM Is Fixing Multimodal AI Generation

The Devil is in the Data: How CoMM Is Fixing Multimodal AI Generation If you’ve ever tried to get an AI to write a coherent picture book or a step-by-step tutorial with consistent illustrations, you’ve likely noticed a problem. While modern Multimodal Large Language Models (MLLMs) are great at describing a single image or generating a single picture from text, they often struggle to tell a continuous story. The characters change appearance between panels, the logic skips steps, or the text and images just don’t seem to talk to each other. ...

[ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate 🔗](https://arxiv.org/abs/2503.21268)

Reaching New Heights: How AI and LiDAR are Mastering Rock Climbing Motion Capture

Introduction In the world of computer vision, teaching machines to understand human movement has been a longstanding goal. We have become quite good at tracking runners on a track, pedestrians on a sidewalk, or dancers in a studio. These are what researchers call “ground-based motions.” The physics are somewhat predictable: gravity pulls down, and feet interact with a flat ground plane. But what happens when humans leave the ground? Rock climbing presents a fascinating and incredibly difficult challenge for Human Motion Recovery (HMR). Climbers are not merely walking; they are solving vertical puzzles. Their bodies contort into extreme poses, limbs stretch to their limits, and their interaction with the environment is complex—hands and feet must find purchase on tiny holds while the body defies gravity. Most existing AI models, trained on walking or running data, fail spectacularly when tasked with analyzing a climber. They struggle to understand where the climber is in the “world” (global position) and often hallucinate poses that are physically impossible on a vertical wall. ...

[Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning 🔗](https://arxiv.org/abs/2412.00175)

The Sound of Silence: How a Hidden Shortcut Broke Deepfake Detectors (And How to Fix It)

Introduction In the cat-and-mouse game of deepfake detection, we often assume that as generative models get better, detection models must simply become more complex to keep up. We rely on massive datasets of real and manipulated videos to train these detectors, trusting that the neural networks are learning to spot subtle artifacts—mismatched lip movements, unnatural blinking, or digital residue on the pixel level. But what if our models aren’t learning what we think they are learning? What if, instead of analyzing the complex interplay between audio and video, they are cheating? ...

[CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation 🔗](https://arxiv.org/abs/2506.09343)

Read the Manual! Why Robots Need Instructions to Master Household Appliances

Read the Manual! Why Robots Need Instructions to Master Household Appliances Imagine you’ve just bought a high-end espresso machine. It has four knobs, a lever, and a digital screen. You want to make a double-shot latte. Do you just start pushing buttons at random? Probably not. You pull out the user manual, find the “Getting Started” section, identify which button controls the steam wand, and follow the steps. Now, imagine a robot trying to do the same thing. Until now, most robotic research has relied on “common sense” or training data where a robot sees a handle and assumes it should be pulled. But sophisticated appliances don’t always follow common sense. A button on a microwave could start the heating process, or it could just set the clock. Without reading the manual, a robot is just guessing. ...

[Change3D: Revisiting Change Detection and Captioning from a Video Modeling Perspective 🔗](https://arxiv.org/abs/2503.18803)

Treating Time as Time: How Change3D Revolutionizes Remote Sensing with Video Modeling

Change detection is one of the most fundamental tasks in computer vision for remote sensing. Whether we are assessing damage after a natural disaster, monitoring urban expansion, or tracking deforestation, the core goal remains the same: compare two images taken at different times and identify what is different. For years, the standard approach has been to treat this as a “Spot the Difference” game using static images. We take an image from Time A, an image from Time B, and ask a neural network to compare them. ...

[Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhao_Can_Machines_Understand_Composition_Dataset_and_Benchmark_for_Photographic_Image_CVPR_2025_paper.pdf)

Rule of Thirds vs. AI: Can Machines Actually See Photographic Composition?

We often hear that AI can “see.” Computer vision models can identify a dog, a car, or a person in an image with superhuman accuracy. Generative models can create photorealistic scenes from scratch. But there is a subtle, artistic layer to photography that goes beyond just identifying objects: Composition. Composition is the art of arranging visual elements within a frame to create coherence and aesthetic appeal. It is why a photo taken by a professional looks “right,” while the same scene shot by an amateur might look cluttered or unbalanced. ...

[Can Generative Video Models Help Pose Estimation? 🔗](https://arxiv.org/abs/2412.16155)

Bridging the Gap - How Generative Video Models Solve Impossible Pose Estimation Problems

Introduction: The Human Ability to “Hallucinate” Geometry Imagine you are standing in a classroom. You take a photo of the blackboard at the front. Then, you turn around and walk to the back of the room, taking a photo of a student’s desk. These two photos have zero overlap—there are no common visual features between them. If you feed these two images into a traditional computer vision algorithm and ask, “Where is the second camera located relative to the first?”, the algorithm will fail. It looks for matching pixels, keypoints, or textures. Finding none, it cannot mathematically compute the geometry. ...

[CRISP: Object Pose and Shape Estimation with Test-Time Adaptation 🔗](https://arxiv.org/abs/2412.01052)

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation Imagine a robot arm tasked with cleaning up space debris. It sees a satellite floating in orbit. To grab it safely, the robot needs to know two things with high precision: where the satellite is (its pose) and what it looks like geometrically (its shape). In a controlled lab environment with perfect data, this is a solvable problem. But in the real world—or in space—lighting changes, sensors add noise, and objects might look slightly different than the 3D models the robot was trained on. This is known as the domain gap, and it is one of the biggest hurdles in computer vision today. ...

[COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts 🔗](https://arxiv.org/abs/2504.10158)

When Models Leave the Lab: Benchmarking AI in the Wild with COUNTS

Introduction Imagine you are training a self-driving car system. You train it on thousands of hours of video footage taken in sunny California. The model achieves 99% accuracy in detecting pedestrians, other cars, and stop signs. Then, you deploy the car in a snowy Canadian town or a dimly lit tunnel. Suddenly, the system fails to recognize a pedestrian wearing a winter coat against a white background. This scenario illustrates one of the most persistent challenges in modern computer vision and Artificial Intelligence: Out-of-Distribution (OOD) generalization. ...