CVPR 2025

[Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers 🔗](https://arxiv.org/abs/2507.04388)

Unlocking the Black Box: How CoIBA Interprets Vision Transformers Using a Comprehensive Information Bottleneck

Introduction In the rapidly evolving landscape of computer vision, the Vision Transformer (ViT) has emerged as a powerhouse. From self-driving cars to medical imaging, ViTs are achieving remarkable performance, often outperforming traditional Convolutional Neural Networks (CNNs). However, like many deep learning models, they suffer from a significant drawback: they act as “black boxes.” We feed an image in, and a classification comes out, but we often have little insight into why the model made that decision. ...

[Compositional Caching for Training-free Open-vocabulary Attribute Detection 🔗](https://arxiv.org/abs/2503.19145)

Beyond the Label: How Compositional Caching Revolutionizes Attribute Detection Without Training

Introduction: The Complexity of “Simple” Description In computer vision, identifying an object—say, a “car”—is a problem that has largely been solved. We have robust models that can spot a car in a crowded street with high accuracy. But what if we want to go deeper? What if we need to know if the car is rusty, wet, metallic, or vintage? This is the challenge of Attribute Detection. Unlike object classification, which deals with concrete nouns, attribute detection deals with adjectives. These properties shape how we perceive the world, but they are notoriously difficult for AI models to grasp. ...

[Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models 🔗](https://arxiv.org/abs/2503.18337)

Beyond Weights: Fine-Tuning Transformers by Remixing Attention Heads

If you have ever tried to fine-tune a Large Language Model (LLM) or a massive Vision Transformer (ViT), you know the struggle: these models are heavy. Full-parameter fine-tuning is computationally expensive and memory-intensive. To solve this, the community turned to Parameter-Efficient Fine-Tuning (PEFT). The most famous example is LoRA (Low-Rank Adaptation), which freezes the pre-trained model and injects small, trainable rank decomposition matrices. Most of these methods focus on the linear projection layers—the weights (\(W_q, W_k, W_v\)) that transform your data. ...

[CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_CoSER_Towards_Consistent_Dense_Multiview_Text-to-Image_Generator_for_3D_Creation_CVPR_2025_paper.pdf)

Solved: The Janus Problem? How CoSER Brings Consistency to Text-to-3D Generation

Imagine typing “a bear dressed in medieval armor” into a computer, and seconds later, receiving a fully rotatable, high-quality 3D asset ready for a video game. This is the dream of Text-to-3D generation. While we have mastered 2D image generation (thanks to tools like Midjourney and Stable Diffusion), lifting this capability to 3D dimensions remains surprisingly difficult. A common failure mode is the “Janus problem”—named after the two-faced Roman god—where a generated model might have a face on both the front and the back of its head because the model doesn’t understand that the back view shouldn’t look like the front view. ...

[CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation 🔗](https://arxiv.org/abs/2406.10462)

The Devil is in the Data: How CoMM Is Fixing Multimodal AI Generation

The Devil is in the Data: How CoMM Is Fixing Multimodal AI Generation If you’ve ever tried to get an AI to write a coherent picture book or a step-by-step tutorial with consistent illustrations, you’ve likely noticed a problem. While modern Multimodal Large Language Models (MLLMs) are great at describing a single image or generating a single picture from text, they often struggle to tell a continuous story. The characters change appearance between panels, the logic skips steps, or the text and images just don’t seem to talk to each other. ...

[ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate 🔗](https://arxiv.org/abs/2503.21268)

Reaching New Heights: How AI and LiDAR are Mastering Rock Climbing Motion Capture

Introduction In the world of computer vision, teaching machines to understand human movement has been a longstanding goal. We have become quite good at tracking runners on a track, pedestrians on a sidewalk, or dancers in a studio. These are what researchers call “ground-based motions.” The physics are somewhat predictable: gravity pulls down, and feet interact with a flat ground plane. But what happens when humans leave the ground? Rock climbing presents a fascinating and incredibly difficult challenge for Human Motion Recovery (HMR). Climbers are not merely walking; they are solving vertical puzzles. Their bodies contort into extreme poses, limbs stretch to their limits, and their interaction with the environment is complex—hands and feet must find purchase on tiny holds while the body defies gravity. Most existing AI models, trained on walking or running data, fail spectacularly when tasked with analyzing a climber. They struggle to understand where the climber is in the “world” (global position) and often hallucinate poses that are physically impossible on a vertical wall. ...

[Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning 🔗](https://arxiv.org/abs/2412.00175)

The Sound of Silence: How a Hidden Shortcut Broke Deepfake Detectors (And How to Fix It)

Introduction In the cat-and-mouse game of deepfake detection, we often assume that as generative models get better, detection models must simply become more complex to keep up. We rely on massive datasets of real and manipulated videos to train these detectors, trusting that the neural networks are learning to spot subtle artifacts—mismatched lip movements, unnatural blinking, or digital residue on the pixel level. But what if our models aren’t learning what we think they are learning? What if, instead of analyzing the complex interplay between audio and video, they are cheating? ...

[CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation 🔗](https://arxiv.org/abs/2506.09343)

Read the Manual! Why Robots Need Instructions to Master Household Appliances

Read the Manual! Why Robots Need Instructions to Master Household Appliances Imagine you’ve just bought a high-end espresso machine. It has four knobs, a lever, and a digital screen. You want to make a double-shot latte. Do you just start pushing buttons at random? Probably not. You pull out the user manual, find the “Getting Started” section, identify which button controls the steam wand, and follow the steps. Now, imagine a robot trying to do the same thing. Until now, most robotic research has relied on “common sense” or training data where a robot sees a handle and assumes it should be pulled. But sophisticated appliances don’t always follow common sense. A button on a microwave could start the heating process, or it could just set the clock. Without reading the manual, a robot is just guessing. ...

[Change3D: Revisiting Change Detection and Captioning from a Video Modeling Perspective 🔗](https://arxiv.org/abs/2503.18803)

Treating Time as Time: How Change3D Revolutionizes Remote Sensing with Video Modeling

Change detection is one of the most fundamental tasks in computer vision for remote sensing. Whether we are assessing damage after a natural disaster, monitoring urban expansion, or tracking deforestation, the core goal remains the same: compare two images taken at different times and identify what is different. For years, the standard approach has been to treat this as a “Spot the Difference” game using static images. We take an image from Time A, an image from Time B, and ask a neural network to compare them. ...

[Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhao_Can_Machines_Understand_Composition_Dataset_and_Benchmark_for_Photographic_Image_CVPR_2025_paper.pdf)

Rule of Thirds vs. AI: Can Machines Actually See Photographic Composition?

We often hear that AI can “see.” Computer vision models can identify a dog, a car, or a person in an image with superhuman accuracy. Generative models can create photorealistic scenes from scratch. But there is a subtle, artistic layer to photography that goes beyond just identifying objects: Composition. Composition is the art of arranging visual elements within a frame to create coherence and aesthetic appeal. It is why a photo taken by a professional looks “right,” while the same scene shot by an amateur might look cluttered or unbalanced. ...

[Can Generative Video Models Help Pose Estimation? 🔗](https://arxiv.org/abs/2412.16155)

Bridging the Gap - How Generative Video Models Solve Impossible Pose Estimation Problems

Introduction: The Human Ability to “Hallucinate” Geometry Imagine you are standing in a classroom. You take a photo of the blackboard at the front. Then, you turn around and walk to the back of the room, taking a photo of a student’s desk. These two photos have zero overlap—there are no common visual features between them. If you feed these two images into a traditional computer vision algorithm and ask, “Where is the second camera located relative to the first?”, the algorithm will fail. It looks for matching pixels, keypoints, or textures. Finding none, it cannot mathematically compute the geometry. ...

[CRISP: Object Pose and Shape Estimation with Test-Time Adaptation 🔗](https://arxiv.org/abs/2412.01052)

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation Imagine a robot arm tasked with cleaning up space debris. It sees a satellite floating in orbit. To grab it safely, the robot needs to know two things with high precision: where the satellite is (its pose) and what it looks like geometrically (its shape). In a controlled lab environment with perfect data, this is a solvable problem. But in the real world—or in space—lighting changes, sensors add noise, and objects might look slightly different than the 3D models the robot was trained on. This is known as the domain gap, and it is one of the biggest hurdles in computer vision today. ...

[COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts 🔗](https://arxiv.org/abs/2504.10158)

When Models Leave the Lab: Benchmarking AI in the Wild with COUNTS

Introduction Imagine you are training a self-driving car system. You train it on thousands of hours of video footage taken in sunny California. The model achieves 99% accuracy in detecting pedestrians, other cars, and stop signs. Then, you deploy the car in a snowy Canadian town or a dimly lit tunnel. Suddenly, the system fails to recognize a pedestrian wearing a winter coat against a white background. This scenario illustrates one of the most persistent challenges in modern computer vision and Artificial Intelligence: Out-of-Distribution (OOD) generalization. ...

[CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering 🔗](https://arxiv.org/abs/2503.00413)

How to Teach AI New Tricks Without Forgetting the Old: A Deep Dive into CL-MoE

Imagine you are learning a second language. You spend months mastering French. Then, you switch gears to learn Spanish. A few months later, you try to speak French again, but you find yourself inserting Spanish words or, worse, you’ve forgotten the French grammar entirely. This phenomenon, known in psychology as catastrophic forgetting, is a massive headache for Artificial Intelligence, specifically for Multimodal Large Language Models (MLLMs). These models, like GPT-4V or Gemini, are incredibly powerful at understanding images and answering questions about them. However, the world changes constantly. We want these models to learn new types of data and tasks continuously without having to retrain them from scratch (which costs millions of dollars) and—crucially—without them forgetting what they learned previously. ...

[CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_CH3Depth_Efficient_and_Flexible_Depth_Foundation_Model_with_Flow_Matching_CVPR_2025_paper.pdf)

Solving the Depth Estimation Trilemma: Inside CH3Depth

Depth estimation—the ability to look at a 2D image and understand the 3D geometry within it—is a cornerstone of computer vision. It is the prerequisite for autonomous driving, robot navigation, mixed reality, and content generation. However, building an “ideal” depth estimation model has historically been a game of trade-offs. You usually have to pick two of the following three: Meticulous Detail: Can the model see the fine edges of leaves or the texture of a distant building? Temporal Consistency: If applied to a video, does the depth map flicker, or is it stable over time? Efficiency: Can it run in real-time on a robot, or does it take several seconds per frame? Recent foundation models like Marigold or Depth Anything have pushed the boundaries of detail, but often at the cost of speed or video stability. In this post, we will explore CH3Depth, a new research paper that proposes a unified framework to solve this trilemma. Using a technique called Flow Matching combined with a novel sampling strategy, CH3Depth achieves state-of-the-art results in both image and video depth estimation while being significantly faster than its predecessors. ...

[CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Tian_CCIN_Compositional_Conflict_Identification_and_Neutralization_for_Composed_Image_Retrieval_CVPR_2025_paper.pdf)

When Pictures Clash with Words: Solving Compositional Conflicts in Image Retrieval with CCIN

Introduction Imagine you are shopping online for a shirt. You find a photo of a shirt that has the perfect cut and fabric, but it’s blue, and you really wanted it in grey. In a standard text search, describing exactly what you want is difficult (“shirt like this but grey”). This scenario is where Composed Image Retrieval (CIR) shines. CIR allows users to search using a combination of a reference image (the blue shirt) and a text instruction (“change blue to grey”). Ideally, the system understands that it should keep the shape and fabric from the image but swap the color based on the text. ...

[CASP: Compression of Large Multimodal Models Based on Attention Sparsity 🔗](https://arxiv.org/abs/2503.05936)

Breaking the 2-Bit Barrier: How Attention Sparsity Unlocks Extreme Compression for Multimodal Models

Introduction In the rapidly evolving world of Artificial Intelligence, Large Multimodal Models (LMMs) have emerged as the new titans. Models like LLaVA and GPT-4V can see, read, and reason, bridging the gap between visual and textual data. However, this capability comes at a steep price: computational resources. To put this into perspective, running a 70-billion parameter model like LLaVA-Onevision at standard 16-bit precision requires roughly 140GB of GPU memory. This effectively walls off these powerful models from consumer hardware and efficient edge deployment. To solve this, researchers turn to model compression, specifically quantization—reducing the precision of the model’s weights (e.g., from 16-bit floating point to 4-bit or 2-bit integers). ...

[CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design 🔗](https://arxiv.org/abs/2504.19478)

Solving the Tetris of Interior Design: How CASAGPT Uses Cuboids for Collision-Free Scene Synthesis

Introduction Imagine you are trying to furnish a virtual apartment. You place a stylish L-shaped sofa in the corner and a coffee table in the nook of the “L”. To you, this is a perfect, cozy arrangement. But to a computer using traditional 3D understanding, you might have just caused a catastrophe. Why? Because many computer vision models view objects not as complex shapes, but as simple bounding boxes. If the bounding box of the table touches the bounding box of the sofa—even if the actual objects don’t touch—the computer screams “Collision!” Conversely, models might generate scenes where objects physically overlap in reality because the coarse boxes allowed it. ...

[CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction 🔗](https://arxiv.org/abs/2411.16170)

Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy

Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy In the rapidly evolving world of Computer Vision, the Vision Transformer (ViT) has been a revolutionary force. By adapting the self-attention mechanisms originally designed for Natural Language Processing (NLP), ViTs have achieved state-of-the-art results in image classification, object detection, and segmentation. However, there is a catch. The very mechanism that makes Transformers so powerful—Self-Attention—is computationally expensive. Specifically, it has “quadratic complexity.” As the resolution of an image increases, the computational cost explodes. This makes standard Transformers notoriously difficult to deploy on resource-constrained devices like mobile phones, where battery life and latency are critical. ...

[CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image 🔗](https://arxiv.org/abs/2504.11230)

Breaking the Parts Barrier: How CAP-Net Masters Articulated Object Perception

Breaking the Parts Barrier: How CAP-Net Masters Articulated Object Perception Imagine you are a robot tasked with a seemingly simple household chore: opening a laptop. To a human, this is trivial. You identify the lid, find the edge, and lift. But to a robot, this is a geometric nightmare. The laptop is not a solid brick; it is an articulated object—a structure composed of rigid parts connected by joints. The lid moves relative to the base, changing the object’s overall shape. ...