CVPR 2025

[Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting 🔗](https://arxiv.org/abs/2505.21943)

Escaping the Trap of Point-to-Point Loss: How Point-to-Region Matching Solves Semi-Supervised Crowd Counting

Imagine looking at a photograph of a packed stadium or a bustling city square. Your task is to count every single person. In computer vision, this is the task of Crowd Counting, and it is critical for urban planning, safety monitoring, and traffic control. Deep learning has made massive strides in this field. However, there is a bottleneck: data annotation. To train a model to count people, humans currently have to manually place a dot on the head of every single person in thousands of training images. In a dense crowd, a single image might contain thousands of people. The labor cost is astronomical. ...

[PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes 🔗](https://arxiv.org/abs/2412.03451)

Beyond Points and Gaussians: Reconstructing Indoor Scenes with PlanarSplatting

If you look around the room you are sitting in right now, what do you see? Ideally, you see walls, a floor, a ceiling, maybe a desk or a bookshelf. Geometrically speaking, you are surrounded by planes. While humans instantly perceive these structured, flat surfaces, getting a computer to reconstruct them from 2D images is notoriously difficult. Traditional 3D reconstruction methods often output “point clouds” or “meshes” that look like melted wax—lumpy, noisy, and lacking the sharp geometric definition of a real wall or table. ...

[Pippo: High-Resolution Multi-View Humans from a Single Image 🔗](https://arxiv.org/abs/2502.07785)

From Selfie to Studio: How Pippo Generates High-Res 3D Avatars from a Single Photo

From Selfie to Studio: How Pippo Generates High-Res 3D Avatars from a Single Photo Imagine taking a quick, casual photo of yourself with your smartphone and, within moments, having a high-resolution, 360-degree video of your digital twin—complete with details of your back, the texture of your hair, and the folds in your clothes, all fully 3D consistent. This capability is the “holy grail” for applications ranging from the metaverse and gaming to virtual fashion and telepresence. ...

[PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset 🔗](https://arxiv.org/abs/2403.11116)

Diagnosing Hallucinations in Multimodal AI: Inside the PhD Benchmark

Imagine showing a photograph of a cat sleeping on a table to an Artificial Intelligence. You ask, “Is there a dog in this picture?” The AI confidently replies, “Yes, there is a dog sleeping on the table.” This phenomenon is known as visual hallucination. It is one of the most persistent and perplexing challenges in the field of Multimodal Large Language Models (MLLMs)—systems like LLaVA, Qwen-VL, or GPT-4V that can see and speak. While these models have demonstrated incredible capabilities, they frequently fabricate objects, misinterpret attributes, or agree with false premises provided in the text prompt. ...

[Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics 🔗](https://arxiv.org/abs/2503.20308)

Beyond Vertex Error: How to Make 3D Talking Heads Actually Look Real

If you have ever played a modern RPG or watched a dubbed movie using AI-generated lip-sync, you have likely experienced the “Uncanny Valley.” The character’s lips are moving, and technically they are hitting the right shapes for the sounds, but something feels off. The mouth might open perfectly for an ‘a’ vowel, but it lacks the energy of the shout, or the timing is just a few milliseconds too robotic. ...

[PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models 🔗](https://arxiv.org/abs/2412.18608)

Beyond Monolithic Meshes: How PartGen Decomposes and Reconstructs 3D Objects

The field of Generative AI has moved at a breakneck pace. We started with blurry 2D images, moved to high-fidelity photorealism, and have now arrived at the frontier of generating 3D assets from simple text prompts. Tools like DreamFusion and various mesh generators can create a “A beagle in a detective’s outfit” in seconds. But there is a catch. Most current methods generate what we call “unstructured” assets. The beagle, the detective hat, and the magnifying glass are all fused into a single, continuous mesh or radiance field. For a game developer or an animator, this is a problem. You cannot simply take the hat off the dog, nor can you animate the legs independently, because the model doesn’t know where the leg ends and the body begins. In professional workflows, structure is just as important as appearance. ...

[Parallelized Autoregressive Visual Generation 🔗](https://arxiv.org/abs/2412.15119)

Breaking the Speed Limit of Autoregressive Image Generation with PAR

Introduction In the world of Generative AI, Autoregressive (AR) models are the heavy lifters. They are the architecture behind the Large Language Models (LLMs) that power ChatGPT and Claude. Their premise is simple but powerful: predict the next piece of data based on everything that came before it. When applied to text, they write one word at a time. When applied to computer vision, they paint an image one “token” (a compressed patch of an image) at a time. ...

[Panorama Generation From NFoV Image Done Right 🔗](https://arxiv.org/abs/2503.18420)

Escaping the Visual Cheating Trap: How to Generate Geometrically Correct 360° Panoramas

Imagine you are standing in a beautiful cathedral. You take a photo with your standard smartphone camera. That photo captures a “Narrow Field of View” (NFoV)—essentially a small rectangle of the whole scene. Now, imagine asking an AI to take that small rectangle and hallucinate the rest of the cathedral—the ceiling, the floor, and everything behind you—to create a perfect 360-degree sphere that you can view in a VR headset. ...

[PGC: Physics-Based Gaussian Cloth from a Single Pose 🔗](https://arxiv.org/abs/2503.20779)

The Best of Both Worlds: Combining Physics and Gaussians for Realistic Digital Cloth

Digital clothing has always been a thorn in the side of computer graphics. If you play modern video games or watch visual effects breakdowns, you might notice that while faces are becoming indistinguishable from reality, clothing often lags behind. It either looks like a stiff plastic shell, or it moves weirdly, or it lacks that fuzzy, tactile “softness” that real fabric has. Traditionally, we’ve had to choose between two imperfect options: mesh-based simulations that move well but lack detailed texture, or volumetric captures that look photorealistic but break apart when they move. ...

[Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_Overcoming_Shortcut_Problem_in_VLM_for_Robust_Out-of-Distribution_Detection_CVPR_2025_paper.pdf)

Stop Cheating! How to Fix Shortcut Learning in Vision-Language Models

Imagine you are a student taking a test. You encounter a question showing a picture of a grassy field with a small, blurry animal in it. You aren’t 100% sure what the animal is, but you know that cows are usually on grass. So, you guess “cow.” You get it right. But what if the next picture is a grassy field with a boat in it? If you rely solely on the “grass” shortcut, you might still guess “cow.” ...

[Order-One Rolling Shutter Cameras 🔗](https://arxiv.org/abs/2403.11295)

Taming the Jello Effect - A Unified Theory for Rolling Shutter Geometry

If you have ever taken a photo of a spinning propeller or a fast-moving train out of a car window using your smartphone, you have likely witnessed the “Rolling Shutter” effect. Propellers look like warped boomerangs; vertical poles look slanted; cars look like they are leaning forward. This phenomenon, often called the “Jello effect,” occurs because most modern consumer cameras (CMOS sensors) do not capture the entire image at the exact same instant. Instead, they scan the scene line by line, usually from top to bottom. If the camera or the object moves during that scanning process, the geometry of the image breaks. ...

[OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_OpticalNet_An_Optical_Imaging_Dataset_and_Benchmark_Beyond_the_Diffraction_CVPR_2025_paper.pdf)

Breaking the Laws of Physics? How OpticalNet Uses AI to See the Invisible

Breaking the Laws of Physics? How OpticalNet Uses AI to See the Invisible For centuries, the quest to see the “tiny world” has been a driving force in science. From the rudimentary magnifying glasses of antiquity to the sophisticated microscopes of today, we have relentlessly pursued higher resolution. But there has always been a fundamental wall: the diffraction limit. Physics dictates that optical systems cannot resolve features significantly smaller than half the wavelength of the light used to illuminate them. For visible light, this limit is around 200 nanometers. This means viruses, DNA strands, and the intricate machinery of life often remain just out of focus, appearing as blurry blobs rather than distinct structures. ...

[OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation 🔗](https://arxiv.org/abs/2412.00115)

Solving the Human Problem in AI Video: A Deep Dive into OpenHumanVid

If you have experimented with recent video generation models like Sora, Stable Video Diffusion, or MovieGen, you have likely noticed a recurring pattern. These models can generate breathtaking landscapes, cyberpunk cities, and surreal abstractions with ease. But the moment you ask for a video of a human speaking or performing a complex action, the cracks begin to show. Faces distort, hands morph into eldritch horrors, and movements defy the laws of physics. ...

[Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces 🔗](https://arxiv.org/abs/2503.19199)

Beyond Geometry: Teaching Robots to Understand Functionality in 3D Spaces

Imagine you are a robot navigating a kitchen. You scan the room and perfectly identify a refrigerator, a cabinet, and a sink. You know exactly where they are located in 3D space. But now, you are given a command: “Open the fridge.” Suddenly, your perfect geometric map is insufficient. You know where the fridge is, but do you know how to interact with it? Do you know which specific handle belongs to the fridge door? Do you understand that pulling that handle causes the door to swing open? Or consider a more complex command: “Turn on the ceiling light.” You can see the light fixture, but the switch is on a wall three meters away. To a standard 3D perception system, there is no physical link between that switch and that light. ...

[Open-Canopy: Towards Very High Resolution Forest Monitoring 🔗](https://arxiv.org/abs/2407.09392)

Scaling Up Forest Monitoring: Inside the Open-Canopy Benchmark for Very High Resolution Satellite Imagery

Introduction If you want to know how much carbon a forest stores, or how healthy an ecosystem is, you need to know the height of the trees. It sounds simple, but measuring canopy height at a global—or even national—scale is a logistical nightmare. You cannot send a team of researchers with tape measures into every hectare of woodland. Traditionally, we have relied on two extremes. On one end, we have Aerial Laser Scanning (ALS), or LiDAR flown on planes. It is incredibly accurate, creating dense 3D clouds of the forest structure, but it is prohibitively expensive and rarely updated. On the other end, we have satellite imagery. Satellites like Sentinel-2 pass over frequently and freely, but their resolution (10 to 30 meters per pixel) is often too coarse to distinguish individual trees or detect subtle logging activities. ...

[One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Jin_One-shot_3D_Object_Canonicalization_based_on_Geometric_and_Semantic_Consistency_CVPR_2025_paper.pdf)

Taming the 3D Wild - How One-Shot Canonicalization Aligns Objects Using LLMs and Geometry

Imagine walking into a library where every book is thrown onto the floor in a random pile. Finding “Moby Dick” would be a nightmare. Now, imagine a library where every book is shelved, spine out, upright, and categorized. This is essentially the problem of 3D Object Canonicalization. In computer vision and 3D generation, we often deal with “messy libraries.” We scrape 3D models from the internet, but they come in arbitrary orientations—some are upside down, some face left, others face right. To make this data useful for AI, we need to “canonicalize” it: align every object to a standard coordinate system (e.g., all cars face positive X, all chairs stand upright on Y). ...

[OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities 🔗](https://arxiv.org/abs/2412.16604)

OmniSplat: How to Master 3D Scene Reconstruction with 360° Images

Imagine trying to reconstruct a full 3D room from just two photographs. In the world of computer vision, this “sparse-view reconstruction” is the holy grail for Virtual Reality (VR) and Augmented Reality (AR). Recently, 3D Gaussian Splatting (3DGS) has revolutionized this field, offering real-time rendering speeds that older methods, like NeRFs, struggled to achieve. However, there is a catch. Most of these breakthroughs rely on “perspective” images—standard photos with a limited field of view. But if you want to capture a whole room quickly, you don’t take fifty narrow photos; you take one or two omnidirectional (360°) images. ...

[Omnimanip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints 🔗](https://arxiv.org/abs/2501.03841)

Bridging the Gap: How OmniManip Connects VLM Reasoning to Precise Robot Action

The dream of general-purpose robotics is a machine that can walk into a messy kitchen, identify a teapot and a cup, and pour you a drink without having been explicitly programmed for that specific teapot or that specific cup. In recent years, we have seen massive leaps in Vision-Language Models (VLMs). These models (like GPT-4V) have incredible “common sense.” They can look at an image and tell you, “That is a teapot, you hold it by the handle, and you pour liquid from the spout.” However, knowing what to do is very different from knowing exactly how to do it in 3D space. ...

[Olympus: A Universal Task Router for Computer Vision Tasks 🔗](https://arxiv.org/abs/2412.09612)

From Jack-of-All-Trades to Master Conductor: How Olympus Redefines Multimodal AI

Introduction: The Dilemma of “All-in-One” AI In the rapidly evolving world of Artificial Intelligence, there is a massive race to build the ultimate “All-in-One” model. We’ve seen Multimodal Large Language Models (MLLMs) like GPT-4 and LLaVA that can see, read, and reason. We’ve seen Generative models like Stable Diffusion and Sora that can create breathtaking images and videos. Naturally, the industry’s instinct has been to smash these capabilities together into a single, gigantic neural network—a “Jack-of-All-Trades” that can generate, edit, segment, and answer questions all at once. While models like Emu3 and Omni-Gen have made strides here, they face a significant hurdle: conflicting objectives. Training a model to strictly reason about an image (understanding) often conflicts with training it to dream up new pixels (generation). Furthermore, these massive omni-models are incredibly expensive to train and difficult to scale. If a better image generator comes out next month, you have to retrain your entire massive model to use it. ...

[K Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding 🔗](https://arxiv.org/abs/2503.00361)

Taming Hallucinations in Vision-Language Models with the Octopus Framework

Introduction Imagine asking an AI to describe a picture of a soccer field. The model confidently replies, “A player in a green jersey is kicking the ball toward the goal.” It sounds perfect, except for one problem: there is no ball in the picture. This phenomenon is known as hallucination. Large Vision-Language Models (LVLMs), despite their incredible ability to understand images and text, frequently fabricate objects, attributes, or relationships that simply don’t exist. For casual use, this is annoying. For critical applications like medical imaging analysis or autonomous driving, it is dangerous. ...