CVPR 2025

[Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity 🔗](https://arxiv.org/abs/2412.06171)

Elementary, My Dear Watson: How Holmes-VAU Solves Video Anomalies Like a Detective

Introduction Imagine you are a detective reviewing CCTV footage of a busy city street. Hours of mundane traffic pass by: cars stopping at red lights, pedestrians crossing, rain falling. Suddenly, for three seconds, a car swerves erratically and clips a bus before speeding off. If you were a traditional computer vision model, you might flag a “spike” in an anomaly score at that timestamp. But you wouldn’t necessarily know why. Was it a fight? An explosion? A traffic accident? Furthermore, to understand that this was a “hit-and-run,” you need to watch the moments leading up to the swerve and the aftermath. You need context. ...

[High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model 🔗](https://arxiv.org/abs/2504.01512)

From Flat to Form: How GS-RGBN Masters Single-Image 3D Reconstruction

Introduction One of the most captivating challenges in computer vision is the “Holy Grail” of 3D generation: taking a single, flat photograph of an object and instantly reconstructing a high-fidelity, 3D model that looks good from every angle. Imagine snapping a photo of a toy on your desk and immediately importing it into a video game or a VR environment. While generative AI has made massive strides in 2D image creation, lifting that capability to 3D has proven significantly harder. The core problem is geometric ambiguity. A single image tells you what an object looks like from one specific angle, but it leaves the back, sides, and internal geometry completely up to interpretation. ...

[High-Fidelity Lightweight Mesh Reconstruction from Point Clouds 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhang_High-Fidelity_Lightweight_Mesh_Reconstruction_from_Point_Clouds_CVPR_2025_paper.pdf)

Smart Meshing - How to Reconstruct High-Fidelity Lightweight 3D Models from Point Clouds

Introduction In the world of 3D computer vision and graphics, reconstructing a surface from a point cloud is a fundamental task. Whether you are scanning a room for AR applications or creating assets for a video game, the goal is often the same: take a cloud of disconnected dots and turn them into a watertight, smooth, and detailed 3D mesh. For years, the gold standard for this process has been Marching Cubes (MC). When combined with Neural Implicit Representations—specifically Signed Distance Functions (SDFs)—MC is incredibly reliable. However, it has a significant flaw: it is rigid. MC operates on a fixed resolution grid. If you want high details, you need a high-resolution grid, which generates millions of tiny triangles, leading to massive file sizes and memory usage. If you want a lightweight file, you lower the grid resolution, but you immediately lose sharp edges and fine details. ...

[Hardware-Rasterized Ray-Based Gaussian Splatting 🔗](https://arxiv.org/abs/2503.18682)

High-Fidelity at High FPS — Mastering Hardware-Rasterized Ray-Based Gaussian Splatting

Introduction In the rapidly evolving world of 3D reconstruction and rendering, we are currently witnessing a tug-of-war between two critical factors: speed and quality. On one hand, we have 3D Gaussian Splatting (3DGS), which took the world by storm with its ability to render scenes in real-time using rasterization. On the other hand, we have high-fidelity approaches like Ray-Based Gaussian Splatting (RayGS), which offer superior visual quality—particularly for complex geometries and view-dependent effects—but suffer from computational heaviness that makes them struggle in real-time applications, especially in Virtual Reality (VR). ...

[HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos 🔗](https://arxiv.org/abs/2501.02973)

Unlocking World-Space Hand Motion: How HaWoR Solves Egocentric 3D Reconstruction

Imagine wearing a VR headset or AR glasses. You reach out to grab a virtual cup of coffee. For the experience to feel real, the system needs to know exactly where your hand is—not just in the camera’s view, but in the actual 3D room. This sounds simple, but it is a surprisingly difficult computer vision problem. In “egocentric” (first-person) video, two things are moving at once: your hands and your head (the camera). Traditional methods struggle to separate these motions. If you turn your head left, it looks like your hand moved right. Furthermore, your hands frequently drop out of the camera’s view, causing tracking systems to “forget” where they are. ...

[HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos 🔗](https://arxiv.org/abs/2411.19167)

Beyond Single-View: Inside HOT3D, the New Benchmark for Egocentric Hand-Object Interaction

Introduction The dexterity of the human hand is a defining trait of our species. Whether we are assembling furniture, typing code, or whisking eggs, we constantly interact with the physical world to manipulate objects. For Artificial Intelligence, understanding these interactions is the holy grail of embodied perception. If an AI can truly understand how hands and objects move together in 3D space, we unlock possibilities ranging from teaching robots manual skills to creating augmented reality (AR) interfaces that turn any surface into a virtual keyboard. ...

[HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation 🔗](https://arxiv.org/abs/2411.18335)

360° Vision: Solving Depth Estimation in the Wild with HELVIPAD

Introduction Imagine you are a mobile robot tasked with navigating a crowded university campus. To move safely, you need to know exactly how far away every object is—not just the ones directly in front of you, but the pedestrians approaching from the side, the pillars behind you, and the walls curving around you. You need 360-degree spatial awareness. Historically, the gold standard for this kind of omnidirectional sensing has been LiDAR (Light Detection and Ranging). LiDAR is precise and naturally covers 360 degrees. However, it is expensive, bulky, and the resulting point clouds become sparse at a distance. As a result, researchers have turned to stereo vision: using synchronized cameras to estimate depth, much like human eyes do. ...

[H-MoRe: Learning Human-centric Motion Representation for Action Analysis 🔗](https://arxiv.org/abs/2504.10676)

Beyond Optical Flow: How H-MoRe Revolutionizes Human Motion Analysis

Introduction In the world of computer vision, understanding human movement is a cornerstone task. Whether it’s for healthcare rehabilitation systems, security surveillance, or generating realistic video animations, the computer needs to know not just where a person is, but how they are moving. For years, researchers have relied on two primary tools: Optical Flow (tracking every pixel’s movement) and Pose Estimation (tracking skeleton joints). While both are useful, they have significant flaws. Optical flow is noisy—it tracks blowing leaves and passing cars just as attentively as the human subject. Pose estimation is precise but overly abstract—it reduces a complex human body to a stick figure, losing crucial shape information. ...

[GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Han_GroundingFace_Fine-grained_Face_Understanding_via_Pixel_Grounding_Multimodal_Large_Language_CVPR_2025_paper.pdf)

Beyond the Selfie: How GroundingFace Teaches AI to See Micro-Expressions and Makeup

In the rapidly evolving world of Computer Vision, Multimodal Large Language Models (MLLMs) have achieved something formerly thought impossible: they can look at an image and describe it with near-human fluency. Models like GPT-4V or LLaVA can identify a person in a photo, tell you they are smiling, and perhaps describe their clothing. However, if you ask these general models to identify specific, fine-grained details—like the precise location of “crow’s feet” wrinkles, the style of eyeliner applied, or the exact boundaries of a skin blemish—they often fail. They lack fine-grained grounding, the ability to link specific textual concepts to precise pixels on a high-resolution face. ...

[Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Dampfhoffer_Graph_Neural_Network_Combining_Event_Stream_and_Periodic_Aggregation_for_CVPR_2025_paper.pdf)

Breaking the Speed Limit: How Hybrid Event Graphs Enable Microsecond Optical Flow

Introduction Imagine driving a car at high speed. You rely on your eyes to detect motion instantly. Now, imagine if your brain only processed visual information in snapshots taken every few milliseconds. In that tiny fraction of a blind spot between snapshots, a sudden obstacle could appear, and you wouldn’t react in time. This is the fundamental limitation of traditional, frame-based computer vision. Standard cameras capture the world as a series of still images. To calculate motion—specifically optical flow—algorithms compare one frame to the next. This introduces latency. You cannot detect motion until the next frame is captured and processed. For high-speed robotics, autonomous drones, or safety-critical systems, this delay (often tens of milliseconds) is an eternity. ...

[Gradient-Guided Annealing for Domain Generalization 🔗](https://arxiv.org/abs/2502.20162)

Aligning the Compass: How Gradient-Guided Annealing Solves the Domain Generalization Puzzle

Introduction Imagine you are training a robot to recognize cows. You show it thousands of pictures of cows standing in grassy fields. The robot gets a perfect score during training. Then, you take the robot to a snowy mountain range, show it a cow, and it stares blankly, identifying the object as a “rock.” Why did it fail? Because the robot didn’t learn what a cow looks like; it learned that “green background = cow.” When the green background disappeared (replaced by white snow), the model’s confidence collapsed. ...

[Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion 🔗](https://arxiv.org/abs/2412.00505)

Perception Over Pixels: Solving the Image Compression Trilemma with Wasserstein Distortion

There is an old project management adage that applies frustratingly well to engineering: “Good, Fast, Cheap. Pick two.” In the world of image compression, this “impossible trinity” dictates the limits of our technology. You can have high visual fidelity (Good) and low file size (Cheap), but it will likely require computationally expensive, slow AI models to decode. Conversely, you can have a codec that is lightning fast and produces tiny files (like standard JPEGs at low quality), but the result will look blocky, blurry, and distinctly “digital.” ...

[Goku: Flow Based Video Generative Foundation Models 🔗](https://arxiv.org/abs/2502.04896)

Inside Goku: How Rectified Flow and Joint Training are Revolutionizing Video Generation

Introduction The race for generative video dominance has been one of the most exciting developments in artificial intelligence over the last few years. While diffusion models have become the standard for generating stunning static images, applying them to video—with its added dimension of time—has introduced massive computational bottlenecks and stability issues. Most current video models treat video generation as an extension of image generation, often patching together spatial and temporal attention modules. However, a new contender has emerged from researchers at The University of Hong Kong and ByteDance. Named Goku, this family of models proposes a unified, industry-grade solution that handles both images and videos within a single framework. ...

[Glossy Object Reconstruction with Cost-effective Polarized Acquisition 🔗](https://arxiv.org/abs/2504.07025)

3D Scanning Shiny Objects on a Budget: How Polarization and AI Solve the Specular Problem

Introduction If you have ever tried to perform 3D reconstruction using photogrammetry, you have likely encountered the “glossy object” nightmare. You take a series of photos of a ceramic vase or a metallic toy, feed them into your software, and the result is a melted, noisy blob. Why does this happen? Most standard 3D reconstruction algorithms assume that the world is Lambertian. In simple terms, they assume that a point on an object has the same color regardless of the angle from which you view it. But glossy and specular (mirror-like) surfaces break this rule. As you move your camera, the reflection of the light source moves across the surface. To the algorithm, this moving highlight looks like the geometry itself is shifting or disappearing, leading to catastrophic failure in the 3D mesh. ...

[GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities 🔗](https://arxiv.org/abs/2412.04244)

GigaHands: Bridging the Gap in AI Hand Understanding with Massive Scale Data

Introduction The human hand is an evolutionary masterpiece. Whether you are threading a needle, typing a blog post, or kneading dough, your hands perform a complex symphony of movements. For Artificial Intelligence and robotics, however, replicating this dexterity is one of the “grand challenges” of the field. We have seen Large Language Models (LLMs) revolutionize text by training on trillions of words. We have seen vision models master image recognition by looking at billions of pictures. But when it comes to bimanual (two-handed) object interaction, we hit a wall. The data just isn’t there. ...

[Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis 🔗](https://arxiv.org/abs/2412.02168)

Beyond Prompts: Teaching AI the Physics of Photography

Introduction We are living in the golden age of AI image generation. Tools like Stable Diffusion and FLUX allow us to conjure detailed worlds from a single sentence. Yet, for all their magic, these models often fail at a task that is fundamental to professional photography: understanding the physical camera. Imagine you are a photographer. You take a photo of a mountain trail with a 24mm wide-angle lens. Then, without moving your feet, you switch to a 70mm zoom lens. What happens? The perspective compresses, the field of view narrows, but the scene—the specific rocks, the shape of the trees, the lighting—remains exactly the same. ...

[Generative Omnimatte: Learning to Decompose Video into Layers 🔗](https://arxiv.org/abs/2411.16683)

Unlayering Reality: How Generative Omnimatte Decomposes Video with Diffusion Models

Introduction Video editing is fundamentally different from image editing for one frustrating reason: pixels in a video are flat. When you watch a movie, you see actors, shadows, and backgrounds, but the computer just sees a grid of changing colors. If you want to remove a person from a scene, you can’t just click “delete.” You have to fill in the background behind them, frame by frame. If you want to move a car slightly to the left, you have to hallucinate what the road looked like underneath it. ...

[Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation 🔗](https://arxiv.org/abs/2412.15211)

How to Turn Inconsistent Photos into Perfect 3D Models: A Generative Relighting Approach

Introduction One of the most persistent challenges in 3D computer vision is the assumption of a static world. Traditional 3D reconstruction techniques, such as Photogrammetry or Neural Radiance Fields (NeRFs), generally assume that while the camera moves, the scene itself remains frozen. But in the real world, this is rarely true. If you are scraping a collection of photos of a famous landmark from the internet, those photos were taken at different times of day, under different weather conditions, and with different cameras. Even in a casual capture session where you walk around an object, the sun might go behind a cloud, or your own shadow might fall across the subject. ...

[Generative Modeling of Class Probability for Multi-Modal Representation Learning 🔗](https://arxiv.org/abs/2503.17417)

Keeping it CALM - Bridging the Video-Text Gap with Class Anchors and Generative Modeling

Introduction In the evolving world of Artificial Intelligence, one of the most fascinating challenges is teaching machines to understand the world through multiple senses simultaneously—specifically, sight and language. This is the domain of Multi-Modal Representation Learning. We want models that can watch a video and understand a textual description of it, or vice versa. Current state-of-the-art methods often rely on Contrastive Learning (like the famous CLIP model). These models work by pulling the representations of a matching video and text pair closer together in a mathematical space while pushing non-matching pairs apart. While effective, this approach has a flaw: it assumes a rigid one-to-one mapping between a video and a sentence. ...

[Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction 🔗](https://arxiv.org/abs/2412.06234)

Generative Densification: How to Add Detail to Feed-Forward 3D Models

If you have been following the explosion of 3D computer vision lately, you are likely familiar with 3D Gaussian Splatting (3D-GS). It has revolutionized the field by representing scenes as clouds of 3D Gaussians (ellipsoids), allowing for real-time rendering and high-quality reconstruction. However, there is a divide in how these models are used. On one side, we have per-scene optimization, where a model spends minutes or hours learning a single specific room or object. This produces incredible detail because the model can iteratively add more Gaussians (densification) where needed. ...