CVPR 2025

[Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning 🔗](https://arxiv.org/abs/2503.07591)

Slash Your AI Training Costs: A New Paradigm for Visual Instruction Tuning

If you have been following the explosion of Large Vision-Language Models (LVLMs) like LLaVA, GPT-4V, or Gemini, you know that their ability to understand and reason about images is nothing short of impressive. However, behind every capable model lies a massive, expensive bottleneck: Visual Instruction Tuning (VIT). To train these models, researchers compile massive datasets of images paired with complex textual instructions (Question-Answer pairs). Creating these datasets usually involves feeding thousands of images into expensive proprietary models like GPT-4 to generate descriptions and QA pairs. This creates a dilemma for students and researchers with limited budgets: to build a high-quality dataset, you need money. To save money, you often have to settle for lower-quality data. ...

[Few-shot Implicit Function Generation via Equivariance 🔗](https://arxiv.org/abs/2501.01601)

Generative AI for Neural Weights: How Symmetry Solves the Few-Shot Problem

In the current landscape of Artificial Intelligence, we are accustomed to models that generate data: pixels for images, tokens for text, or waveforms for audio. But a new frontier is emerging—generating the models themselves. Imagine a system that doesn’t just output a 3D shape, but outputs the neural network weights that represent that shape. This is the promise of Implicit Neural Representations (INRs). INRs use simple Multi-Layer Perceptrons (MLPs) to represent complex continuous signals like 3D objects or gigapixel images. They offer infinite resolution and compact storage. ...

[F3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Saha_F3OCUS_-_Federated_Finetuning_of_Vision-Language_Foundation_Models_with_Optimal_CVPR_2025_paper.pdf)

Balancing Act: Optimizing Federated Fine-Tuning for Vision-Language Models with F3OCUS

Introduction In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like LLaVA and BLIP have emerged as powerful tools capable of understanding and generating content based on both visual and textual inputs. These models hold immense promise for specialized fields such as healthcare, where a model might need to analyze a chest X-ray and answer a doctor’s natural language questions about it. However, deploying these massive “foundation models” in the real world presents a paradox. To make them useful for medical diagnosis, they must be fine-tuned on diverse, real-world medical data. Yet, strict privacy regulations (like HIPAA/GDPR) often prevent hospitals from sharing patient data with a central server. ...

[FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images 🔗](https://arxiv.org/abs/2503.19207)

From Phone Photos to Animatable Avatars in Seconds: Deep Dive into FRESA

Introduction Imagine taking a few quick photos of yourself with your smartphone—front, back, maybe a side profile—and within seconds, having a fully 3D, digital double. Not just a static statue, but a fully rigged, animatable avatar wearing your exact clothes, ready to be dropped into a VR chatroom or a video game. For years, this has been the “holy grail” of 3D computer vision. The reality, however, has been a trade-off between quality and speed. You could either have high-quality avatars generated in a studio with expensive camera rigs (photogrammetry), or you could use neural networks that require hours of optimization per person to learn how a specific t-shirt folds. Neither is scalable for everyday users. ...

Grounding the Avatar: How Geometric Priors and Massive Data Solve Egocentric Motion Capture

Grounding the Avatar: How Geometric Priors and Massive Data Solve Egocentric Motion Capture If you have ever used a modern Virtual Reality (VR) headset, you have likely noticed something missing: your legs. Most current VR avatars are floating torsos with hands, ghosts drifting through a digital void. This isn’t a stylistic choice; it is a technical limitation. Tracking a user’s full body from a headset (egocentric motion capture) is incredibly difficult. The cameras on the headset can barely see the user’s lower body, often blocked by the chest or stomach (self-occlusion). When the cameras do see the legs, the perspective is distorted by fisheye lenses, and the rapid movement of the head makes the video feed chaotic. ...

[FICTION: 4D Future Interaction Prediction from Video 🔗](https://arxiv.org/abs/2412.00932)

Beyond 2D: Predicting 'Where' and 'How' Humans Interact in 3D Space

Beyond 2D: Predicting “Where” and “How” Humans Interact in 3D Space Imagine a robot assistant observing you in the kitchen. You are making tea. You’ve just boiled the water. A truly helpful assistant shouldn’t just recognize that you are currently “standing.” It should anticipate that in the next few seconds, you will walk to the cabinet, reach your arm upward to grab a mug, and then move to the fridge to get milk. ...

[FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation 🔗](https://arxiv.org/abs/2506.11543)

FIMA-Q: Unlocking Low-Bit Vision Transformers with Smarter Fisher Information Approximation

Introduction Vision Transformers (ViTs) have revolutionized computer vision, challenging the long-standing dominance of Convolutional Neural Networks (CNNs). By leveraging self-attention mechanisms, models like ViT, DeiT, and Swin Transformer have achieved remarkable results in classification and detection tasks. However, this performance comes with a hefty price tag: massive parameter counts and high computational overhead. To deploy these heavy models on edge devices—like smartphones or embedded systems—we need to compress them. The most popular method for this is Post-Training Quantization (PTQ). PTQ converts high-precision floating-point weights (32-bit) into low-precision integers (like 4-bit or 8-bit) without requiring a full, expensive retraining of the model. ...

[Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think 🔗](https://arxiv.org/abs/2503.00948)

Motion Modeling is Easier Than You Think: Unlocking Dynamic Video Generation with Model Merging

Image-to-Video (I2V) generation is one of the most exciting frontiers in computer vision. The premise is magical: take a single still photograph—a car on a road, a dog in the grass, a castle on a hill—and breathe life into it. You want the car to drive, the dog to roll over, and the camera to zoom out from the castle. However, if you have played with current I2V diffusion models, you might have encountered a frustrating reality. Often, the “video” is just the input image with a slight wobble, or a “zooming” effect that looks more like a 2D scale than 3D camera movement. Conversely, if the model does generate movement, it often ignores your text prompts completely, creating chaotic motion that has nothing to do with what you asked for. ...

[EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_EventPSR_Surface_Normal_and_Reflectance_Estimation_from_Photometric_Stereo_Using_CVPR_2025_paper.pdf)

How Event Cameras are Revolutionizing 3D Material Scanning

Creating realistic “digital twins” of real-world objects is a cornerstone of modern computer graphics, powering everything from movie VFX to immersive VR/AR experiences. To make a digital object look real, you need two things: its shape (surface normal) and its material properties (how shiny or rough it is). For years, this has been a tug-of-war between speed and quality. Traditional methods, like Photometric Stereo (PS), require capturing hundreds of High Dynamic Range (HDR) images under different lights. This is slow, data-heavy, and often fails on “tricky” materials—specifically, objects that are very shiny or metallic. ...

[Event fields: Capturing light fields at high speed, resolution, and dynamic range 🔗](https://arxiv.org/abs/2412.06191)

Event Fields: When High-Speed Vision Meets Light Field Imaging

Introduction Imagine trying to photograph a bullet speeding through the air. Now, imagine that after you’ve taken the photo, you decide you actually wanted to focus on the target behind the bullet, not the bullet itself. Traditionally, this is impossible. You would need a high-speed camera to freeze the motion, and a light field camera to change the focus. But high-speed cameras are data-hungry beasts, often requiring gigabytes of storage for a few seconds of footage, and light field cameras are notoriously slow or bulky. ...

[Event Ellipsometer: Event-based Mueller-Matrix Video Imaging 🔗](https://arxiv.org/abs/2411.17313)

Seeing the Unseen: High-Speed Polarization Video with Event Cameras

Introduction In the world of computer vision, we usually obsess over the intensity of light—how bright or dark a pixel is. But light carries another hidden layer of information: polarization. When light bounces off an object, its electromagnetic orientation changes. These changes encode rich details about the object’s shape, material composition, and surface texture that standard cameras simply miss. To capture this information fully, scientists use Ellipsometry. This technique measures the “Mueller matrix,” a \(4 \times 4\) grid of numbers that completely describes how a material transforms polarized light. It is a powerful tool used in everything from biology to material science. ...

[EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events 🔗](https://arxiv.org/abs/2505.04657)

Beyond Frames: How Event Cameras Are Revolutionizing Continuous Video Super-Resolution

1. Introduction We live in a world dominated by video content, yet we are often limited by the hardware that captures it. Most videos are archived at fixed resolutions (like 1080p) and fixed frame rates (usually 30 or 60 fps). But what if you want to zoom in on a distant detail without it becoming a pixelated mess? Or what if you want to slow down a fast-moving action shot without it looking like a slideshow? ...

[Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras 🔗](https://arxiv.org/abs/2502.19630)

Seeing the Unseen: How Event Cameras Solve the 'Blind Time' Crisis in Autonomous Driving

Introduction Imagine you are driving down a highway at 60 miles per hour. For a split second, you close your eyes. In that brief moment, the car in front of you slams on its brakes. That split second—where you have no visual information—is terrifying. Now, consider an autonomous vehicle. These systems rely heavily on sensors like LiDAR and standard frame-based cameras. While sophisticated, these sensors have a fundamental limitation: they operate at a fixed frame rate, typically around 10 to 20 Hz. This means there is a gap, often up to 100 milliseconds, between every snapshot of the world. In the research community, this is known as “blind time.” ...

[Estimating Body and Hand Motion in an Ego-sensed World 🔗](https://arxiv.org/abs/2410.03665)

EgoAllo: How Smart Glasses Can See Your Whole Body

Introduction Imagine wearing a pair of smart glasses. You are walking through your living room, reaching for a coffee mug, or typing on a keyboard. The glasses have cameras, but they are facing outward to map the world. They can see the mug, the table, and maybe your hands entering the frame. But they can’t see you—or at least, not your torso, legs, or feet. This “invisibility” presents a massive challenge for Augmented Reality (AR) and robotics. If a computer system wants to understand your actions, it needs to know your full body pose. Is the user sitting or standing? Are they leaning forward? Where are their feet planted? ...

[Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways 🔗](https://arxiv.org/abs/2503.07026)

Unlearning to See: How EraDiff Teaches Diffusion Models to Erase Objects Properly

Introduction Imagine you have a perfect photo of a pepperoni pizza, but you want to remove just one specific slice to show the wooden plate underneath. You fire up a state-of-the-art AI inpainting tool, mask out the slice, and hit “generate.” Ideally, the AI should generate the texture of the wooden plate. But often, standard diffusion models will do something frustrating: they replace the pepperoni slice with… a cheese slice. Or perhaps a distorted “ghost” of the pepperoni remains. ...

[Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wu_Enhanced_Visual-Semantic_Interaction_with_Tailored_Prompts_for_Pedestrian_Attribute_Recognition_CVPR_2025_paper.pdf)

Beyond Static Labels - Tailoring Prompts for Smarter Pedestrian Recognition

Introduction Imagine scanning hours of security footage trying to locate a specific individual. You aren’t just looking for a face; you are looking for descriptors: “a woman wearing a red dress,” “a man with a backpack,” or “someone wearing glasses.” In Computer Vision, this task is known as Pedestrian Attribute Recognition (PAR). For years, this field was dominated by systems that simply looked at an image and tried to guess the tags. However, the rise of Vision-Language Models (like CLIP) has introduced a new paradigm: using text to help the computer “understand” the image better. ...

[ENERGYMOGEN: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space 🔗](https://arxiv.org/abs/2412.14706)

Mastering Motion: How Energy-Based Models Enable Complex AI Choreography

Introduction Imagine asking an AI to generate an animation of a person “walking forward.” By today’s standards, this is a solved problem. Modern diffusion models can generate a realistic walk cycle in seconds. But what happens if you increase the complexity? What if you ask for a person “walking forward AND waving both hands, but NOT turning around”? This is where standard generative models often stumble. Humans are masters of composition. We can effortlessly blend simple concepts—walking, waving, looking left—into a single, coherent behavior. We can also understand negative constraints (what not to do) just as easily as positive ones. ...

[Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_Enduring_Efficient_and_Robust_Trajectory_Prediction_Attack_in_Autonomous_Driving_CVPR_2025_paper.pdf)

How Cardboard Boxes Can Confuse Autonomous Cars: Inside the OMP-Attack

The promise of Autonomous Driving (AD) is built on trust—trust that the vehicle can perceive its environment, predict what others will do, and plan a safe route. But what if a few strategically placed cardboard boxes could shatter that trust? In the world of adversarial machine learning, researchers are constantly probing for weaknesses to build safer systems. A recent paper, Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework, uncovers a significant vulnerability in how self-driving cars predict the movement of other vehicles. The authors introduce a new method, the OMP-Attack, which uses simple physical objects to trick an autonomous vehicle (AV) into slamming on its brakes to avoid a phantom collision. ...

[End-to-End HOI Reconstruction Transformer with Graph-based Encoding 🔗](https://arxiv.org/abs/2503.06012)

How HOI-TG Solves the Global-Local Conflict in 3D Human-Object Reconstruction

In the rapidly evolving world of Computer Vision, reconstructing 3D humans from 2D images is a well-studied problem. But humans rarely exist in a vacuum. We hold phones, sit on chairs, ride bikes, and carry boxes. When you add objects to the equation, the complexity explodes. This field, known as Human-Object Interaction (HOI) reconstruction, faces a fundamental conflict. To reconstruct a 3D scene, you need to understand the global structure (where the person is relative to the object) and the local details (how the fingers wrap around a handle). Most existing methods struggle to balance these two, often prioritizing one at the expense of the other. ...

[Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility 🔗](https://arxiv.org/abs/2505.21377)

Dream3DVG: Bridging the Gap Between Text-to-3D and Vector Graphics

Dream3DVG: Bridging the Gap Between Text-to-3D and Vector Graphics In the world of digital design, vector graphics are the gold standard for clarity and scalability. Unlike pixel-based images (raster graphics), which get blurry when you zoom in, vector graphics are defined by mathematical paths—lines, curves, and shapes—that remain crisp at any resolution. They are the backbone of logos, icons, and conceptual art. However, vector graphics have traditionally been shackled to a 2D plane. If you draw a vector sketch of a car, you can’t simply rotate it to see the back bumper; the drawing is fixed from that specific viewpoint. While recent advancements in AI have enabled “Text-to-3D” generation, applying these techniques to the sparse, abstract world of vector strokes has been notoriously difficult. When you try to force standard 3D generation methods to create line drawings, you often end up with a “tangle of wires”—messy, inconsistent lines that don’t look like a cohesive drawing when rotated. ...