CVPR 2025

[LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis 🔗](https://arxiv.org/abs/2412.15214)

Beyond Flat Video: Mastering 3D Trajectory Control in Generative AI with LeviTor

Introduction In the rapidly evolving world of Generative AI, we have moved quickly from creating static images to generating full-motion video. Tools like Sora, Runway, and Stable Video Diffusion have shown us that AI can dream up dynamic scenes. However, for these tools to be useful in professional workflows—like filmmaking, game design, or VR—random generation isn’t enough. We need control. Specifically, we need to tell the AI exactly where an object should move. This concept, known as “drag-based interaction,” allows users to click a point on an image and drag it to a new location, signaling the model to animate that movement. ...

[Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition 🔗](https://arxiv.org/abs/2409.16434)

The Hitchhiker's Guide to PEFT: Unlocking Vision Transformers with Tiny Updates

The Hitchhiker’s Guide to PEFT: Unlocking Vision Transformers with Tiny Updates If you are working in Computer Vision today, you are likely living in the era of “Download, Pre-train, Fine-tune.” We have access to massive foundation models like Vision Transformers (ViT) or CLIP, trained on millions (or billions) of images. But there is a catch: these models are gigantic. Fine-tuning a billion-parameter model for a specific task—like classifying rare bird species or detecting defects in manufacturing—usually involves updating all the parameters (Full Fine-Tuning). This is computationally expensive and storage-heavy. If you have 50 downstream tasks, you need to store 50 copies of that massive model. ...

[Less is More: Efficient Model Merging with Binary Task Switch 🔗](https://arxiv.org/abs/2412.00054)

How to Reduce Model Storage by 97%: The T-Switch Method

In the rapidly evolving landscape of Artificial Intelligence, we have moved from training models from scratch to a new paradigm: taking a massive pre-trained model and fine-tuning it for specific tasks. Whether it’s a vision model trained to recognize satellite imagery or a language model fine-tuned for legal advice, we are surrounded by “specialist” versions of “generalist” models. But this creates a logistical nightmare. If you want a system that can handle 50 different tasks, do you store 50 copies of a multi-gigabyte neural network? That is computationally expensive and storage-intensive. ...

[Learning to Filter Outlier Edges in Global SfM 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Damblon_Learning_to_Filter_Outlier_Edges_in_Global_SfM_CVPR_2025_paper.pdf)

Cleaning Up the Mess—How Graph Neural Networks and Line Graphs are Revolutionizing 3D Reconstruction

Introduction Reconstructing a 3D world from a jumbled collection of 2D photographs is one of the “magic tricks” of computer vision. This process, known as Structure-from-Motion (SfM), powers everything from Google Earth 3D views to autonomous vehicle mapping and digital heritage preservation. While the results can be stunning, the process is mathematically fragile. Traditional methods, known as incremental SfM, build the 3D model one image at a time. They are accurate but slow, often taking hours or days for large datasets. The alternative is global SfM, which attempts to solve the entire puzzle at once. Global methods are fast and scalable, but they have a major Achilles’ heel: they are incredibly sensitive to noise. ...

[Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation 🔗](https://arxiv.org/abs/2504.02697)

Seeing Through the Heat: How MambaTM and Learned Phase Distortion Solve Atmospheric Turbulence

Introduction Imagine looking down a long stretch of asphalt on a hot summer day. The air shimmers, causing the scenery to wobble, blur, and distort. This phenomenon, known as atmospheric turbulence, is a chaotic combination of blurring and geometric warping caused by temperature variations affecting the refractive index of air. While this “heat haze” might look artistic to the naked eye, it is a nightmare for long-range imaging systems used in surveillance, remote sensing, and astronomy. ...

[Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zou_Learning_Conditional_Space-Time_Prompt_Distributions_for_Video_Class-Incremental_Learning_CVPR_2025_paper.pdf)

Mastering Video Continual Learning by Teaching Models to Dream Up Prompts

Introduction Imagine teaching a child to recognize a dog. Once they learn that, you teach them to recognize a cat. Ideally, learning about the cat shouldn’t make them forget what a dog looks like. This is the essence of Continual Learning (CL). Humans are naturally good at this; artificial neural networks, however, are not. When deep learning models are trained on new data classes sequentially, they tend to suffer from “Catastrophic Forgetting”—they optimize for the new task and overwrite the weights necessary for the old ones. ...

[Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection 🔗](https://arxiv.org/abs/2503.21099)

Bridging the Gap Between Indoor and Outdoor 3D Object Detection with Class Prototypes

Introduction In the rapidly evolving world of computer vision, 3D object detection stands as a pillar for technologies like autonomous driving and embodied robotics. To navigate the world, machines must perceive it in three dimensions. However, the deep learning models powering these perceptions have a massive hunger for data—specifically, precise 3D bounding box annotations. Annotating 3D point clouds is notoriously labor-intensive and expensive. While 2D images are relatively easy to label, rotating a 3D scene to draw exact boxes around every chair, car, or pedestrian requires significant human effort. This bottleneck has led researchers to explore sparse supervision—a training setting where only a small fraction of objects in a scene are annotated. ...

[Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene 🔗](https://arxiv.org/abs/2503.15019)

How to Train Your 4D Model: Learning Dynamic Scene Graphs from Static 2D Images

Imagine teaching a robot to understand the world. If you show it a photo of a kitchen, it might identify a “cup” and a “table.” But the real world isn’t a static photo; it is a continuous, dynamic stream of events. A person walks in, picks up the cup, and drinks from it. To truly perceive reality, an AI needs to understand not just what things are, but how they interact over time and space. ...

[Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis 🔗](https://arxiv.org/abs/2412.20651)

Bridging the Gap: How Latent Drifting Adapts Stable Diffusion for Medical Imaging

Introduction In the last few years, the field of computer vision has been completely upended by Generative AI. Models like Stable Diffusion and DALL-E have demonstrated an uncanny ability to generate photorealistic images from simple text prompts. They “know” what a dog looks like, how a sunset reflects on water, and what an astronaut riding a horse resembles. This is achieved by training on massive datasets containing billions of image-text pairs (like LAION-5B). ...

[Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Dai_Label_Shift_Meets_Online_Learning_Ensuring_Consistent_Adaptation_with_Universal_CVPR_2025_paper.pdf)

Taming the Data Stream: How Universal Dynamic Regret Solves Online Label Shift

In the idealized world of machine learning, data is static. We train a model on a dataset, validate it, and deploy it, assuming the world will behave exactly like our training set forever. But in the real world, data is a restless stream. Trends change, behaviors shift, and the distributions of classes we are trying to predict fluctuate wildly over time. Imagine a human activity recognition system running on a smartphone. During the morning commute, the model sees a lot of “walking” and “sitting” data. At the gym in the evening, the stream shifts to “running” and “jumping.” Later at night, it’s mostly “lying down.” The relationship between the sensor data (features) and the activity (labels) hasn’t changed—a jump is still a jump—but the frequency of those labels has shifted dramatically. ...

[LATEXBLEND: Scaling Multi-concept Customized Generation with Latent Textual Blending 🔗](https://arxiv.org/abs/2503.06956)

Mixing Memories: How LATEXBLEND Scales Personalized AI Art

The world of text-to-image generation has moved beyond simply typing “a cat” and getting a generic cat. Today, users want their cat—specifically, the fluffy tabby sitting on their couch right now. This is known as customized generation. While teaching an AI to recognize a single specific subject (like your pet or a specific toy) has become standard practice with tools like DreamBooth or LoRA, a significant bottleneck remains: scalability. What if you want to generate an image of your specific dog, playing with your specific cat, sitting on your specific sofa, in front of your specific house? ...

[LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Gong_LP-Diff_Towards_Improved_Restoration_of_Real-World_Degraded_License_Plate_CVPR_2025_paper.pdf)

Unblurring the Unreadable: How LP-Diff Restores Real-World License Plates

Introduction We have all seen the trope in crime investigation shows: a grainy, pixelated video of a getaway car is played, an agent says “Enhance,” and suddenly the license plate is crystal clear. In reality, License Plate Image Restoration (LPIR) is incredibly difficult. Factors like high speeds, poor lighting, long distances, and camera shake combine to create severe degradation that confuses even the best Optical Character Recognition (OCR) systems. While Deep Learning has revolutionized image restoration, there is a hidden flaw in many existing approaches: they rely on synthetic data. Researchers typically take a high-quality image and artificially add blur or noise to train their models. But a Gaussian blur applied in software looks nothing like the complex, non-linear distortion caused by a vehicle moving at 60 mph on a rainy night. ...

[LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models 🔗](https://arxiv.org/abs/2501.18954)

Beyond Class Names: How LLMDet Uses Detailed Captions to Revolutionize Object Detection

Introduction For years, the field of object detection has been constrained by a “closed-set” mentality. Traditional models were trained to recognize a specific list of categories—typically the 80 classes found in the COCO dataset (like “person,” “car,” or “dog”). If you showed these models a “platypus” or a “drone,” they would remain silent or misclassify it because they simply didn’t have the vocabulary. This limitation led to the rise of Open-Vocabulary Object Detection (OVD). By training models on massive amounts of image-text pairs (using frameworks like CLIP), researchers created detectors that could find objects they had never seen during training, simply by prompting them with text. However, a significant gap remains. Most current OVD methods, such as GLIP, rely on short, region-level text—simple nouns or brief phrases like “a running dog.” ...

[KAC: Kolmogorov-Arnold Classifier for Continual Learning 🔗](https://arxiv.org/abs/2503.21076)

Giving AI a Better Memory: Meet the Kolmogorov-Arnold Classifier

Deep learning models are exceptional at learning specific tasks. Train a model to classify dogs, and it will do it perfectly. But ask that same model to learn how to classify cars afterward, and you encounter a notorious problem: Catastrophic Forgetting. In the process of learning about cars, the model completely forgets what a dog looks like. This is the central challenge of Continual Learning (CL)—how can we teach machines to learn sequentially, task after task, just like humans do, without erasing previous knowledge? ...

[Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Majhi_Just_Dance_with_pi_A_Poly-modal_Inductor_for_Weakly-supervised_Video_CVPR_2025_paper.pdf)

Can AI Hallucinate Depth and Pose to Catch Crimes? Inside PI-VAD

Can AI Hallucinate Depth and Pose to Catch Crimes? Inside PI-VAD Imagine you are watching a security camera feed in a busy store. You see a customer pick up an item, look at it, and put it in their bag. Is this a normal shopping event, or is it shoplifting? To a human observer, the context matters. Did they look around nervously? Did they scan the item with a personal scanner? To a standard computer vision model relying solely on pixel data (RGB), the visual difference between “buying” and “stealing” is frustratingly subtle. Both involve reaching, grabbing, and bagging. ...

[Is this Generated Person Exist in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body 🔗](https://arxiv.org/abs/2411.14205)

Fixing AI Hallucinations: How HumanCalibrator Detects and Repairs Anatomical Nightmares

Introduction We are currently living through a golden age of visual synthesis. Text-to-image models like Stable Diffusion, Midjourney, and DALL-E have revolutionized how we create content, allowing us to conjure photorealistic scenes from a single sentence. However, if you have spent any time experimenting with these tools, you have likely encountered the “uncanny valley” of AI generation: the anatomy problem. You ask for a portrait of a guitarist, and the model generates a stunning lighting setup, perfect skin texture, and… three hands. Or perhaps a person standing in a field, missing an ear, or with legs that merge into the grass. These “abnormal human bodies” are not just minor glitches; they are structural failures that shatter the illusion of realism. ...

[Interpreting Object-level Foundation Models via Visual Precision Search 🔗](https://arxiv.org/abs/2411.16198)

Unlocking the Black Box of Vision-Language Models with Visual Precision Search

Introduction Imagine an autonomous vehicle driving through a busy intersection. It suddenly brakes for a pedestrian. As an engineer or a user, you might ask: Did it actually see the pedestrian? Or did it react to a shadow on the pavement that looked like a person? In the era of deep learning, answering this question is notoriously difficult. We have entered the age of Object-level Foundation Models—powerful AI systems like Grounding DINO and Florence-2 that can detect objects and understand complex textual descriptions (e.g., “the guy in white”). While these models achieve incredible accuracy, they operate as “black boxes.” Their internal decision-making processes are vast, complex webs of parameters that are opaque to humans. ...

[InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing 🔗](https://arxiv.org/abs/2505.24315)

How AI Learns to Grasp: Inside InteractAnything

Imagine you are building a virtual world. You have a 3D model of a chair and a 3D model of a human. Now, you want the human to sit on the chair. In traditional animation, this is a manual, tedious process. You have to drag the character, bend their knees, ensure they don’t clip through the wood, and place their hands naturally on the armrests. Now, imagine asking an AI to “make the person sit on the chair,” and it just happens. ...

[INTERMIMIC: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions 🔗](https://arxiv.org/abs/2502.20390)

Mastering the Physics of Interaction: How InterMimic Teaches Virtual Humans to Handle the Real World

Introduction In the world of computer animation and robotics, walking is a solved problem. We can simulate bipedal locomotion with impressive fidelity. However, as soon as you ask a virtual character to interact with the world—pick up a box, sit on a chair, or push a cart—the illusion often breaks. Hands float inches above objects, feet slide through table legs, or the character simply flails and falls over. This is the challenge of Physics-Based Human-Object Interaction (HOI). Unlike standard animation, where characters move along predefined paths (kinematics), physics-based characters must use virtual muscles (actuators) to generate forces. They must balance, account for friction, and manipulate dynamic objects that have mass and inertia. ...

[Instruction-based Image Manipulation by Watching How Things Move 🔗](https://arxiv.org/abs/2412.12087)

InstructMove - How Watching Videos Teaches AI to Perform Complex Image Edits

InstructMove: How Watching Videos Teaches AI to Perform Complex Image Edits The field of text-to-image generation has exploded in recent years. We can now conjure hyper-realistic scenes from a simple sentence. However, a significant challenge remains: editing. Once an image is generated (or if you have a real photo), how do you change specific elements—like making a person smile or rotating a car—without destroying the rest of the image’s identity? ...