CVPR 2025

[Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision 🔗](https://arxiv.org/abs/2506.03605)

Teaching Robots to Move: Mining 3D Trajectories from First-Person Video

Teaching Robots to Move: Mining 3D Trajectories from First-Person Video Imagine asking a robot to “pick up the knife on the counter.” To a human, this is trivial. To a robot, it requires a complex understanding of 3D space, object affordance (where to grab), and the specific motion trajectory required to execute the action safely. For years, the gold standard for teaching robots these skills has been Imitation Learning—showing the robot examples of humans performing the task. However, this method has a massive bottleneck: data scarcity. Collecting high-quality 3D data usually requires expensive motion capture (MoCap) labs, instrumented gloves, and tedious setups. We simply cannot scale this up to cover every object and action in the real world. ...

[GenVDM: Generating Vector Displacement Maps From a Single Image 🔗](https://arxiv.org/abs/2503.00605)

Revolutionizing 3D Detailing: How GenVDM Turns Flat Images into Geometric Stamps

If you have ever tried your hand at 3D sculpting—creating digital characters, monsters, or environments—you know the pain of detailing. Sculpting the basic silhouette of a dragon is one thing; sculpting every individual scale, horn, and skin pore is an entirely different battle. To solve this, professional artists don’t sculpt every detail from scratch. They use “stamps,” known technically as Vector Displacement Maps (VDMs). These are powerful tools that allow an artist to take a complex shape (like a nose, an ear, or a set of scales) and “stamp” it onto a base mesh instantly. ...

[Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders 🔗](https://arxiv.org/abs/2412.09586)

Less is More: Solving Gaze Estimation with Frozen Foundation Models

Imagine walking into a crowded room. Almost instantly, you can tell who is talking to whom, who is looking at the clock waiting to leave, and who is staring at the delicious cake on the table. This ability—gaze following—is a fundamental building block of human social interaction. It allows us to infer intent, attention, and social dynamics in a split second. For computers, however, this task is surprisingly difficult. To understand where a person is looking, a model must understand two distinct things: the person’s physical orientation (head pose, eye position) and the semantic context of the scene (where are the objects? how far away are they?). ...

[GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2503.19458)

GaussianUDF: Bridging the Gap Between 3D Gaussians and Open Surface Reconstruction

Introduction In the world of 3D computer vision, reconstructing digital objects from 2D images is a fundamental quest. We want to take a few photos of an object—a T-shirt, a flower, a complex statue—and turn it into a perfect 3D model. For years, this field has been dominated by methods that assume objects are “watertight,” meaning they are closed volumes with a clearly defined inside and outside. Think of a sphere or a cube; you are either inside it or outside it. ...

[Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding 🔗](https://arxiv.org/abs/2503.18578)

Beyond the Flat Universe: How Galaxy Walker Brings Geometric Awareness to AI Astronomy

Beyond the Flat Universe: How Galaxy Walker Brings Geometric Awareness to AI Astronomy When we look at a photograph on a screen, we are looking at a flat, 2D representation of reality. For decades, computer vision models have operated on this same premise. They treat images as flat grids of pixels and process features in Euclidean (flat) vector spaces. But the universe is not flat. From the spherical orbits of planets to the hyperbolic expansion of the cosmos and the warping of spacetime around black holes, the universe is defined by complex, non-Euclidean geometries. When we force astronomical data into standard, flat-space Vision-Language Models (VLMs), we lose critical structural information. ...

[GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control 🔗](https://arxiv.org/abs/2503.03751)

How GEN3C Brings 3D Consistency and Precise Camera Control to Video Generation

Introduction We are currently witnessing a golden age of generative video. Models like Sora, Runway, and Stable Video Diffusion can hallucinate breathtaking scenes from a simple text prompt or a single image. However, if you look closely, cracks begin to appear—specifically when the camera starts moving. Imagine generating a video of a room. As the camera pans left, new objects appear. If you pan back to the right, do those original objects reappear exactly as they were? Often, they don’t. The vase on the table might change color, or a window might vanish entirely. Furthermore, trying to tell a video model to “move the camera 2 meters forward and pan 30 degrees right” is notoriously difficult. Most models treat camera parameters as abstract numbers, struggling to translate them into geometrically accurate pixel shifts. ...

[Functionality understanding and segmentation in 3D scenes 🔗](https://arxiv.org/abs/2411.16310)

Fun3DU: How AI Finds the 'Needle in a Haystack' of 3D Scenes

Introduction Imagine you are a robot in a kitchen. You receive a simple command: “Turn on the microwave.” To you, a human, this is trivial. You look at the microwave, spot the “Start” button, and press it. But for an Artificial Intelligence, this is a monumental challenge. First, the AI must understand that “turn on” implies interacting with a specific button. Second, it must visually locate that tiny button within a complex 3D environment filled with other objects, shadows, and occlusions. Standard computer vision models are great at finding the microwave (the whole object), but they often fail spectacularly at finding the specific functional part (the button) needed to complete a task. ...

[Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers 🔗](https://arxiv.org/abs/2503.03307)

Unlocking 6-DoF Motion: How Event Cameras Can See Rotation and Translation Without an IMU

Unlocking 6-DoF Motion: How Event Cameras Can See Rotation and Translation Without an IMU Imagine trying to navigate a drone through a dense forest at high speed. A standard camera takes snapshots—click, click, click. If you move too fast between clicks, the world blurs, or you miss obstacles entirely. Enter the Event Camera. Instead of taking snapshots, it mimics the biological eye. It has pixels that work independently, firing a signal (an “event”) the instant they detect a change in brightness. This results in a continuous stream of data with microsecond latency, zero motion blur, and high dynamic range. ...

[From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing 🔗](https://arxiv.org/abs/2411.11916)

Beyond Pixels: Using Multi-Agent AI to Generate and Edit Structured Diagrams

Introduction “Drawing is not what one sees but what one can make others see.” — Edgar Degas. In the world of scientific research, software engineering, and education, a picture is worth a thousand words—but only if that picture is accurate. While we have witnessed a revolution in generative AI with tools like Midjourney or DALL-E, there remains a glaring gap in the capability of these models: structured, logical diagrams. Ask a standard image generator to create a “neural network architecture with three layers,” and you will likely get a beautiful, artistic hallucination. The connections might go nowhere, the text will be illegible gibberish, and the logical flow will be nonexistent. On the other hand, asking a coding assistant to “write code for a plot” works for simple bar charts but often fails when the visual requirements become complex or unique, such as specific flowchart logic or intricate mind maps. ...

[From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech 🔗](https://arxiv.org/abs/2503.16956)

Reading Lips like a Pro: How Hierarchical Learning Solves Video-to-Speech Synthesis

Imagine watching a video of a person speaking, but the sound is muted. You can see their lips moving, their facial expressions shifting, and their jaw moving. If you were asked to “dub” this video simply by watching it, could you do it? You might guess the words (lip-reading), but could you guess the sound of their voice? The pitch? The emotional inflection? This is the challenge of Video-to-Speech (VTS) synthesis. It’s a fascinating problem in computer vision and audio processing with applications ranging from restoring silent archival films to assisting people with speech disabilities. ...

[FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis 🔗](https://arxiv.org/abs/2505.01172)

FreePCA: How Principal Component Analysis Unlocks Long Video Generation

The world of generative AI is moving fast. We’ve gone from blurry images to photorealistic portraits, and now, the frontier is video. Models like Sora and Runway Gen-2 have dazzled the internet, but behind the scenes, researchers face a stubborn hurdle: duration. Most open-source video diffusion models are trained on very short clips—often just 16 frames. When you ask these models to generate a longer video (say, 64 frames or more), they tend to break down. Objects morf bizarrely, the style shifts, or the video dissolves into noise. Training a model specifically for long videos requires massive computational resources and datasets that most labs simply don’t have. ...

[FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling 🔗](https://arxiv.org/abs/2411.19942)

Why Your 3D Avatar's Skirt Looks Weird (And How 'FreeCloth' Fixes It)

Introduction If you have ever played a modern video game or worked with 3D animation, you might have noticed a peculiar trend: characters in tactical gear, tight superhero suits, or jeans look fantastic. But characters wearing long dresses, flowing skirts, or loose robes? They often look… odd. The fabric might stretch unnaturally between their legs, tear apart when they run, or look like a rigid plastic shell rather than flowing cloth. ...

[Free-viewpoint Human Animation with Pose-correlated Reference Selection 🔗](https://arxiv.org/abs/2412.17290)

Breaking the Camera Angle Barrier in AI Video Generation

Introduction In the rapidly evolving world of Generative AI, animating human characters has become a frontier of intense research. We have seen impressive results where a single photograph of a person can be brought to life, driven by a video of a dancer or a speaker. Models like AnimateAnyone and MagicAnimate have set the standard for this “reference-based” animation. However, these models share a significant limitation: they are generally bound to the viewpoint of the original reference image. ...

[FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation 🔗](https://arxiv.org/abs/2412.02690)

Solved: Why AI Struggles with Hands and How FoundHand Fixes It

Solved: Why AI Struggles with Hands and How FoundHand Fixes It If you have ever played with generative AI tools like Midjourney or Stable Diffusion, you have likely encountered the “hand problem.” You prompt for a photorealistic image of a person, and the face looks perfect, the lighting is cinematic, but the hands are a disaster. Extra fingers, impossible joints, or what looks like a bowl of spaghetti made of flesh. ...

[ForestLPR: LiDAR Place Recognition in Forests Attending Multiple BEV Density Images 🔗](https://arxiv.org/abs/2503.04475)

Lost in the Woods? How ForestLPR Uses Tree Slices for Robust Robot Localization

Imagine you are hiking in a dense forest. You look around, and all you see are trees—trunks, branches, and leaves that look suspiciously similar to the trees you passed ten minutes ago. Now, imagine coming back to that same spot six months later. The leaves have fallen, the grass has grown, and the lighting is completely different. Could you recognize exactly where you are? This scenario highlights one of the hardest problems in robotics: Place Recognition in natural environments. ...

[Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution 🔗](https://arxiv.org/abs/2412.15213)

Rethinking Generative AI: From Gaussian Noise to Direct Cross-Modal Evolution

Introduction For the past few years, the world of Generative AI has been dominated by a single, powerful narrative: Diffusion. Whether you are using DALL-E, Midjourney, or Stable Diffusion, the underlying process is conceptually similar. The model starts with a canvas of pure static (Gaussian noise) and, guided by your text prompt, iteratively denoises it until a coherent image emerges. It is a bit like carving a statue out of a block of marble, where the marble is random noise and the chisel is the text prompt. ...

[FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute 🔗](https://arxiv.org/abs/2502.20126)

FlexiDiT: Smarter, Not Harder—Dynamic Compute for Diffusion Transformers

The landscape of generative AI has shifted dramatically with the adoption of Diffusion Transformers (DiTs). Models like Stable Diffusion 3 and Sora have demonstrated that replacing the traditional U-Net backbone with a Transformer architecture leads to scalable, high-fidelity results. However, this performance comes at a steep computational cost. Current diffusion models operate on a static paradigm: they allocate a fixed, heavy amount of compute to every single step of the denoising process. Whether the model is resolving the vague outline of a composition or refining the texture of a cat’s fur, it burns the same number of FLOPs (Floating Point Operations). ...

[Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality 🔗](https://arxiv.org/abs/2412.16481)

Aligning Geometry with Hardware: How Flash3D Super-Scales Point Cloud Processing

In the rapidly evolving world of 3D deep learning, we are often forced to choose between two virtues: the geometric precision of the model and the computational efficiency of the hardware. Point clouds—the raw data format generated by LiDAR sensors in autonomous vehicles and robotics—are notoriously difficult to process. Unlike images, which are neat, dense grids of pixels, point clouds are sparse and irregular. To make sense of them, neural networks need to understand the spatial relationships between points (geometric locality). However, Graphics Processing Units (GPUs)—the workhorses of modern AI—prefer data that is dense, contiguous, and predictable (hardware locality). ...

[FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement 🔗](https://arxiv.org/abs/2503.04919)

Beyond Bounding Boxes: How FirePlace Teaches AI to Arrange 3D Scenes

Introduction: The “Floating Book” Problem Imagine asking an AI to “place a book on the shelf.” To a human, this is a trivial task. You identify the shelf, find an empty spot on the flat surface, and place the book there upright or stacked. To a standard Multimodal Large Language Model (MLLM), however, this request is fraught with peril. The AI might understand the concept of a shelf and a book, but it lacks a fundamental understanding of 3D geometry. It might place the book floating six inches above the shelf, intersecting through the wood, or balancing precariously on the very edge. Why? Because most AI models treat objects as rough “bounding boxes”—cubes that encompass an object—rather than complex shapes with specific surfaces. If a shelf is treated as a solid box, you can’t put anything inside it. ...

[FineVQ: Fine-Grained User Generated Content Video Quality Assessment 🔗](https://arxiv.org/abs/2412.19238)

Beyond the 5-Star Rating: How FineVQ Revolutionizes Video Quality Assessment with Multimodal AI

In the age of TikTok, YouTube Shorts, and Twitch, User-Generated Content (UGC) has become the dominant form of media consumption. Unlike professionally produced films shot on cinema cameras, UGC is wild and unpredictable. It is shot on smartphones, compressed by apps, streamed over spotty 5G connections, and viewed on varying screen sizes. For video platforms, understanding the quality of this content is a billion-dollar problem. If a recommendation algorithm pushes low-quality videos, users leave. However, traditional Video Quality Assessment (VQA) has a major blind spot: it usually reduces a video’s quality to a single scalar score—a “3.5 out of 5.” ...