Papers

[X-Dyna: Expressive Dynamic Human Image Animation 🔗](https://arxiv.org/abs/2501.10021)

Breathing Life into Pixels: Deep Dive into X-Dyna's Dynamic Human Animation

The dream of “Harry Potter”-style moving photographs has been a driving force in computer vision for decades. We want to take a single static photo of a person and animate it using a driving video—making the subject dance, speak, or walk while preserving their identity. While recent advances in diffusion models have made this possible, there is a lingering “uncanny valley” effect in current state-of-the-art methods. You might see a person dancing perfectly, but their hair behaves like a solid helmet, their dress moves like rigid cardboard, and the background remains frozen in time. The person moves, but the dynamics—the physics of wind, gravity, and momentum—are missing. ...

[World-consistent Video Diffusion with Explicit 3D Modeling 🔗](https://arxiv.org/abs/2412.01821)

Beyond RGB: How WVD Brings Explicit 3D Consistency to Video Diffusion

The recent explosion in generative AI has given us models capable of dreaming up distinctive images and surreal videos from simple text prompts. We have seen tremendous progress with diffusion models, which have evolved from generating static portraits to synthesizing dynamic short films. However, if you look closely at AI-generated videos, you will often notice a subtle, nagging problem: the world doesn’t always stay “solid.” Objects might slightly warp as the camera moves; the geometry of a room might shift impossibly; or the background might hallucinate new details that contradict previous frames. This happens because most video diffusion models are learning pixel consistency over time, but they don’t inherently understand the 3D structure of the world they are rendering. They are excellent 2D artists, but poor 3D architects. ...

[WonderWorld: Interactive 3D Scene Generation from a Single Image 🔗](https://arxiv.org/abs/2406.09394)

Building Infinite 3D Worlds in Seconds: A Deep Dive into WonderWorld

Imagine you are playing a video game or designing a virtual environment. You snap a photo of a street corner, and you want that photo to instantly expand into a fully navigable, infinite 3D world. You want to walk down that street, turn the corner, and see new buildings, parks, and skies generated in real-time, exactly as you imagine them. For years, this has been the “holy grail” of computer vision and graphics. While we have seen massive leaps in generative AI (like Midjourney for 2D images) and 3D reconstruction (like NeRFs and Gaussian Splatting), combining them into a fast, interactive experience has remained elusive. Current methods are typically “offline”—meaning you provide an image, wait 30 minutes to an hour for a server to process it, and get back a static 3D scene. ...

[Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos 🔗](https://arxiv.org/abs/2411.08753)

How to Teach AI to Direct Movies: Using Language to Find the Best Camera Angle

Introduction Imagine you are trying to learn how to repair a bicycle wheel or perfect a basketball jump shot. You find a video tutorial online, but it’s not just a standard video—it’s a multi-view experience recorded by five different cameras. One camera is strapped to the instructor’s head (egocentric), while four others are placed on tripods around the room (exocentric). This rich data is fantastic for capturing every detail, but it presents a massive cognitive load. As a viewer, you cannot watch five screens simultaneously. You need a director—someone (or something) to switch to the “best” view at every moment. When the mechanic is tightening a spoke, you want the close-up of their hands. When the basketball player is driving to the hoop, you want the wide angle showing the court. ...

[WISH: Weakly Supervised Instance Segmentation using Heterogeneous Labels 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Kweon_WISH_Weakly_Supervised_Instance_Segmentation_using_Heterogeneous_Labels_CVPR_2025_paper.pdf)

WISH: Unifying Weak Supervision for Instance Segmentation with the Segment Anything Model

Introduction In the world of computer vision, data is the new oil, but refining that oil—specifically, annotating images—is incredibly expensive. This is particularly true for Instance Segmentation, the task of identifying and outlining every distinct object in an image at the pixel level. Unlike simple bounding boxes or image tags, creating a precise mask for every pedestrian, car, or cup in a dataset requires significant human effort and time. To solve this, researchers have turned to Weakly Supervised Instance Segmentation (WSIS). The goal of WSIS is to train models that can predict pixel-perfect masks while only using “cheap” labels during training. These cheap labels typically fall into three categories: ...

[Volumetrically Consistent 3D Gaussian Rasterization 🔗](https://arxiv.org/abs/2412.03378)

Fixing the Physics of Gaussian Splatting: A Volumetrically Consistent Approach

Introduction In the fast-moving world of neural rendering, we are often forced to choose between two paths: physical accuracy or rendering speed. On one side, we have ray-tracing methods like NeRF (Neural Radiance Fields). They meticulously simulate light passing through a volume, integrating density along rays. They are physically grounded and produce stunningly realistic images, but they can be painfully slow to train and render. On the other side, we have the recent superstar: 3D Gaussian Splatting (3DGS). It is blazingly fast because it treats the scene as a collection of 3D ellipsoids and “splats” them onto the screen using rasterization. However, this speed comes with a hidden cost. To make the math work for real-time rasterization, 3DGS makes several mathematical approximations. It essentially flattens 3D shapes into 2D stains on your screen, breaking the laws of volume rendering physics. ...

[Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution 🔗](https://arxiv.org/abs/2503.02261)

How "Volume Tells" Solves the 3D Microscopy Dilemma: De-noising and Super-Resolution without Ground Truth

Introduction In the world of cell biology, seeing is understanding. 3D fluorescence confocal (FC) microscopy has become an indispensable tool for scientists, allowing them to peer inside living organisms and visualize the complex, volumetric dance of life at a cellular level. From studying how embryos develop to understanding neural connections, the ability to capture 3D data is revolutionary. However, this technology comes with a frustrating trade-off. To keep cells alive during long-term observation, scientists must keep the laser power low. Low power means less signal, which inevitably leads to noisy, grainy images. Furthermore, the physics of microscopes creates a specific problem known as anisotropic resolution. While the image might look sharp in the lateral plane (XY), the resolution along the depth axis (Z) is often terrible—sometimes 4.5 times worse. This results in 3D volumes that look like stacks of pancakes rather than continuous, solid objects. ...

[Visual Representation Learning through Causal Intervention for Controllable Image Editing 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Huang_Visual_Representation_Learning_through_Causal_Intervention_for_Controllable_Image_Editing_CVPR_2025_paper.pdf)

When Diffusion Meets Causality: Fixing Spurious Correlations in AI Image Editing

Introduction Imagine you are using a generative AI tool to edit a photo of a young person. You adjust the “Age” slider to make them look older. The model successfully adds wrinkles and greys the hair, but strangely, it also puts a pair of glasses on the person’s face. You didn’t ask for glasses. You try again, and the same thing happens. Why does this happen? The answer lies in spurious correlations. In many training datasets (like CelebA), older people are statistically more likely to wear glasses than younger people. A standard deep learning model doesn’t understand biology or optics; it simply memorizes patterns. It learns that “Old” and “Glasses” often go together, so when you ask for one, it gives you both. ...

[VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step 🔗](https://arxiv.org/abs/2504.01956)

From Flat Photos to 3D Worlds in a Blink: Understanding VideoScene

Imagine taking two casual photos of a room—perhaps one of the desk and one of the bookshelf—and instantly generating a fully navigable 3D video of the entire space. No expensive scanning equipment, no hours of processing time, and no “hallucinated” geometry where walls warp into furniture. This is the “Holy Grail” of computer vision: Sparse-view 3D reconstruction. While recent advancements in AI video generators (like Sora) are impressive, they struggle with this specific task. They often lack 3D consistency—meaning as the camera moves, the shape of the room might subtly morph. Furthermore, they are slow, requiring dozens of denoising steps to produce a single second of video. ...

[Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 🔗](https://arxiv.org/abs/2405.21075)

Beyond Static Images: Evaluating How MLLMs Understand Long-Form Video with Video-MME

Introduction In the race toward Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have taken center stage. We have seen models like GPT-4V and Gemini demonstrate incredible proficiency in understanding static images—describing complex scenes, reading handwriting, and even explaining memes. However, the real world is not a series of frozen snapshots; it is a dynamic, continuous flow of visual, auditory, and textual information. To truly approximate human-level perception, AI must master video analysis. But here lies a significant gap: while MLLM development has surged, the benchmarks used to test them have lagged behind. Most existing video benchmarks focus on short clips (often just a few seconds long) or lack the diverse data modalities (like subtitles and audio) that make video such a rich medium. ...

[Video Depth Anything: Consistent Depth Estimation for Super-Long Videos 🔗](https://arxiv.org/abs/2501.12375)

Solving the Flicker: How Video Depth Anything Masters Long-Form Depth Estimation

Introduction In the world of computer vision, estimating depth from a single image—determining how far away every pixel is—has seen revolutionary progress. Models like Depth Anything V2 can look at a flat photograph and intuitively understand the 3D geometry of the scene with remarkable accuracy. However, a massive gap remains between understanding a static image and understanding a video. If you simply run a standard image depth model on a video, frame by frame, you encounter a phenomenon known as “flickering.” Because the model processes each frame in isolation, slight changes in lighting or camera angle cause the predicted depth to jump erratically. The result is a jittery, inconsistent mess that is unusable for robotics, augmented reality, or video editing. ...

[VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models 🔗](https://arxiv.org/abs/2411.17451)

Can AI Judge AI? Inside VL-RewardBench and the Quest for Reliable Vision-Language Evaluators

Introduction In the rapidly evolving world of Artificial Intelligence, we have reached a fascinating recursive milestone: we are increasingly relying on AI models to evaluate other AI models. As Large Vision-Language Models (LVLMs) like GPT-4o and Claude 3.5 Sonnet become more capable, human evaluation becomes prohibitively expensive and slow. To solve this, researchers use “Generative Reward Models” (GenRMs)—essentially using a powerful LVLM as a judge to rank responses, provide feedback, and guide the training of newer models through Reinforcement Learning from Human Feedback (RLHF). ...

[VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge 🔗](https://arxiv.org/abs/2411.12915)

Can AI Doctors Use Tools? How VILA-M3 Beats Trillion-Parameter Models by Consulting Experts

Introduction: The Paradox of the Generalist In the rapid evolution of Artificial Intelligence, we have seen the rise of massive “Generalist” Vision-Language Models (VLMs) like GPT-4o and Gemini. These models are incredibly impressive—they can write poetry, analyze charts, and even joke about a photograph. However, when it comes to high-stakes fields like healthcare, being a “jack of all trades” often means being a master of none. A generalist VLM might look at a chest X-ray and correctly identify the lungs, but fail to notice a subtle fracture or a developing tumor that a trained radiologist would spot instantly. Why? Because these models rely on memorized internet knowledge rather than deep, domain-specific visual expertise. They are prone to hallucinations, confidently stating medical facts that are simply wrong. ...

[VEU-Bench: Towards Comprehensive Understanding of Video Editing 🔗](https://arxiv.org/abs/2504.17828)

Can AI Speak the Language of Film? A Deep Dive into VEU-Bench and the Oscars Model

When we watch a movie, we don’t just see a sequence of images; we see a story told through a specific language. A “low-angle shot” makes a character look powerful. A “smash cut” creates sudden shock. A “match cut” draws a thematic connection between two different times or places. As humans, we intuitively understand this visual grammar. However, for Video Large Language Models (Vid-LLMs)—the AI systems designed to understand video content—this “grammar” of film has largely been a foreign language. While modern AI has become exceptionally good at identifying what is happening in a video (e.g., “a man is running”), it has historically struggled to understand how the video is constructed (e.g., “the camera is tracking the man with a handheld shake to imply urgency”). ...

[Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach 🔗](https://arxiv.org/abs/2405.02700)

Beyond the Score: How FINC Scalably Reveals What Generative Models Actually Create

Introduction In the rapidly evolving landscape of Artificial Intelligence, generative models—from GANs to Diffusion models—have become incredibly adept at creating realistic images. When researchers release a new model, they typically attach a scorecard: a Fréchet Inception Distance (FID) or an Inception Score (IS). These metrics provide a single number indicating how “good” the generated images are compared to a reference dataset. But a single number hides a multitude of sins. Two models might have the same FID score but fail in completely different ways. One might refuse to generate cats; the other might generate only cats. One might memorize the training data; the other might hallucinate wildly. Quantitative scores cannot answer the qualitative question: “What specifically is this model generating differently compared to the reference data?” ...

[Unlocking Generalization Power in LiDAR Point Cloud Registration 🔗](https://arxiv.org/abs/2503.10149)

Why Less is More - Removing Cross-Attention to Solve LiDAR Generalization

In the rapidly evolving world of autonomous driving and robotics, sensors are the eyes of the machine. LiDAR (Light Detection and Ranging) stands out as a critical sensor, providing precise 3D maps of the environment. However, raw 3D points are just the starting point. To make sense of the world, a vehicle must “register” these point clouds—stitching together scans taken at different times or from different locations to calculate its own movement and map its surroundings. ...

[Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation 🔗](https://arxiv.org/abs/2412.01027)

InstaManip: Teaching AI to Edit Images by Example Using Group Self-Attention

Introduction: The Limits of Language in Image Editing We are currently living through a golden age of text-to-image generation. Models like Midjourney, DALL-E, and Stable Diffusion have made it incredibly easy to conjure detailed worlds from a simple sentence. However, a significant gap remains between generating an image from scratch and editing an existing one precisely. Consider a specific scenario: You have a photo of a standard sedan, and you want to transform it into a Lamborghini. You type the instruction: “Make it a Lamborghini.” ...

[Universal Scene Graph Generation 🔗](https://arxiv.org/abs/2503.15005)

One Graph to Rule Them All - Unifying Vision, Text, and 3D with Universal Scene Graphs

Introduction Imagine you are a robot walking into a room. You see a man sitting on a sofa. You hear someone say, “Peter is relaxing.” Your depth sensors tell you the sofa is against a wall. As humans, we process all this information seamlessly. We don’t create a separate mental model for what we see, another for what we hear, and a third for spatial depth. We integrate them into a single understanding of the scene: Peter is on the sofa against the wall. ...

[Unified Reconstruction of Static and Dynamic Scenes from Events 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Gao_Unified_Reconstruction_of_Static_and_Dynamic_Scenes_from_Events_CVPR_2025_paper.pdf)

Seeing the Unseen - How URSEE Reconstructs Static Worlds from Dynamic Event Cameras

Introduction Imagine a camera that works like the human eye. It doesn’t take snapshots frame-by-frame; instead, it only reacts when something changes. If you stare at a perfectly still wall, your optic nerve stops firing signals about the wall (though your eyes make tiny, imperceptible movements to prevent this blindness). This is the principle behind Event Cameras (or Dynamic Vision Sensors). They are revolutionary pieces of technology that capture brightness changes asynchronously with microsecond precision. They excel at capturing high-speed motion—think catching a bullet in flight or a drone dodging obstacles—without the motion blur or low dynamic range of standard cameras. ...

[UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior 🔗](https://arxiv.org/abs/2501.13134)

Bridging the Gap: How UniRestore Unifies Human Vision and AI Perception

Imagine you are driving an autonomous vehicle through a thick, heavy fog. For you, the driver, the goal is Perceptual Image Restoration (PIR). You want the fog cleared from your vision so you can see the scenery, the road texture, and the world in high fidelity. You care about aesthetics and clarity. For the car’s computer, however, the goal is Task-Oriented Image Restoration (TIR). The AI doesn’t care if the trees look pretty; it cares about edge detection, object classification, and semantic segmentation. It needs to know exactly where the pedestrian is and where the lane marker ends. ...