[Glossy Object Reconstruction with Cost-effective Polarized Acquisition 🔗](https://arxiv.org/abs/2504.07025)

3D Scanning Shiny Objects on a Budget: How Polarization and AI Solve the Specular Problem

Introduction If you have ever tried to perform 3D reconstruction using photogrammetry, you have likely encountered the “glossy object” nightmare. You take a series of photos of a ceramic vase or a metallic toy, feed them into your software, and the result is a melted, noisy blob. Why does this happen? Most standard 3D reconstruction algorithms assume that the world is Lambertian. In simple terms, they assume that a point on an object has the same color regardless of the angle from which you view it. But glossy and specular (mirror-like) surfaces break this rule. As you move your camera, the reflection of the light source moves across the surface. To the algorithm, this moving highlight looks like the geometry itself is shifting or disappearing, leading to catastrophic failure in the 3D mesh. ...

2025-04 · 8 min · 1671 words
[GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities 🔗](https://arxiv.org/abs/2412.04244)

GigaHands: Bridging the Gap in AI Hand Understanding with Massive Scale Data

Introduction The human hand is an evolutionary masterpiece. Whether you are threading a needle, typing a blog post, or kneading dough, your hands perform a complex symphony of movements. For Artificial Intelligence and robotics, however, replicating this dexterity is one of the “grand challenges” of the field. We have seen Large Language Models (LLMs) revolutionize text by training on trillions of words. We have seen vision models master image recognition by looking at billions of pictures. But when it comes to bimanual (two-handed) object interaction, we hit a wall. The data just isn’t there. ...

2024-12 · 9 min · 1706 words
[Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis 🔗](https://arxiv.org/abs/2412.02168)

Beyond Prompts: Teaching AI the Physics of Photography

Introduction We are living in the golden age of AI image generation. Tools like Stable Diffusion and FLUX allow us to conjure detailed worlds from a single sentence. Yet, for all their magic, these models often fail at a task that is fundamental to professional photography: understanding the physical camera. Imagine you are a photographer. You take a photo of a mountain trail with a 24mm wide-angle lens. Then, without moving your feet, you switch to a 70mm zoom lens. What happens? The perspective compresses, the field of view narrows, but the scene—the specific rocks, the shape of the trees, the lighting—remains exactly the same. ...

2024-12 · 9 min · 1713 words
[Generative Omnimatte: Learning to Decompose Video into Layers 🔗](https://arxiv.org/abs/2411.16683)

Unlayering Reality: How Generative Omnimatte Decomposes Video with Diffusion Models

Introduction Video editing is fundamentally different from image editing for one frustrating reason: pixels in a video are flat. When you watch a movie, you see actors, shadows, and backgrounds, but the computer just sees a grid of changing colors. If you want to remove a person from a scene, you can’t just click “delete.” You have to fill in the background behind them, frame by frame. If you want to move a car slightly to the left, you have to hallucinate what the road looked like underneath it. ...

2024-11 · 9 min · 1764 words
[Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation 🔗](https://arxiv.org/abs/2412.15211)

How to Turn Inconsistent Photos into Perfect 3D Models: A Generative Relighting Approach

Introduction One of the most persistent challenges in 3D computer vision is the assumption of a static world. Traditional 3D reconstruction techniques, such as Photogrammetry or Neural Radiance Fields (NeRFs), generally assume that while the camera moves, the scene itself remains frozen. But in the real world, this is rarely true. If you are scraping a collection of photos of a famous landmark from the internet, those photos were taken at different times of day, under different weather conditions, and with different cameras. Even in a casual capture session where you walk around an object, the sun might go behind a cloud, or your own shadow might fall across the subject. ...

2024-12 · 8 min · 1655 words
[Generative Modeling of Class Probability for Multi-Modal Representation Learning 🔗](https://arxiv.org/abs/2503.17417)

Keeping it CALM - Bridging the Video-Text Gap with Class Anchors and Generative Modeling

Introduction In the evolving world of Artificial Intelligence, one of the most fascinating challenges is teaching machines to understand the world through multiple senses simultaneously—specifically, sight and language. This is the domain of Multi-Modal Representation Learning. We want models that can watch a video and understand a textual description of it, or vice versa. Current state-of-the-art methods often rely on Contrastive Learning (like the famous CLIP model). These models work by pulling the representations of a matching video and text pair closer together in a mathematical space while pushing non-matching pairs apart. While effective, this approach has a flaw: it assumes a rigid one-to-one mapping between a video and a sentence. ...

2025-03 · 8 min · 1675 words
[Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction 🔗](https://arxiv.org/abs/2412.06234)

Generative Densification: How to Add Detail to Feed-Forward 3D Models

If you have been following the explosion of 3D computer vision lately, you are likely familiar with 3D Gaussian Splatting (3D-GS). It has revolutionized the field by representing scenes as clouds of 3D Gaussians (ellipsoids), allowing for real-time rendering and high-quality reconstruction. However, there is a divide in how these models are used. On one side, we have per-scene optimization, where a model spends minutes or hours learning a single specific room or object. This produces incredible detail because the model can iteratively add more Gaussians (densification) where needed. ...

2024-12 · 8 min · 1592 words
[Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision 🔗](https://arxiv.org/abs/2506.03605)

Teaching Robots to Move: Mining 3D Trajectories from First-Person Video

Teaching Robots to Move: Mining 3D Trajectories from First-Person Video Imagine asking a robot to “pick up the knife on the counter.” To a human, this is trivial. To a robot, it requires a complex understanding of 3D space, object affordance (where to grab), and the specific motion trajectory required to execute the action safely. For years, the gold standard for teaching robots these skills has been Imitation Learning—showing the robot examples of humans performing the task. However, this method has a massive bottleneck: data scarcity. Collecting high-quality 3D data usually requires expensive motion capture (MoCap) labs, instrumented gloves, and tedious setups. We simply cannot scale this up to cover every object and action in the real world. ...

2025-06 · 8 min · 1558 words
[GenVDM: Generating Vector Displacement Maps From a Single Image 🔗](https://arxiv.org/abs/2503.00605)

Revolutionizing 3D Detailing: How GenVDM Turns Flat Images into Geometric Stamps

If you have ever tried your hand at 3D sculpting—creating digital characters, monsters, or environments—you know the pain of detailing. Sculpting the basic silhouette of a dragon is one thing; sculpting every individual scale, horn, and skin pore is an entirely different battle. To solve this, professional artists don’t sculpt every detail from scratch. They use “stamps,” known technically as Vector Displacement Maps (VDMs). These are powerful tools that allow an artist to take a complex shape (like a nose, an ear, or a set of scales) and “stamp” it onto a base mesh instantly. ...

2025-03 · 10 min · 1960 words
[Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders 🔗](https://arxiv.org/abs/2412.09586)

Less is More: Solving Gaze Estimation with Frozen Foundation Models

Imagine walking into a crowded room. Almost instantly, you can tell who is talking to whom, who is looking at the clock waiting to leave, and who is staring at the delicious cake on the table. This ability—gaze following—is a fundamental building block of human social interaction. It allows us to infer intent, attention, and social dynamics in a split second. For computers, however, this task is surprisingly difficult. To understand where a person is looking, a model must understand two distinct things: the person’s physical orientation (head pose, eye position) and the semantic context of the scene (where are the objects? how far away are they?). ...

2024-12 · 9 min · 1741 words
[GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2503.19458)

GaussianUDF: Bridging the Gap Between 3D Gaussians and Open Surface Reconstruction

Introduction In the world of 3D computer vision, reconstructing digital objects from 2D images is a fundamental quest. We want to take a few photos of an object—a T-shirt, a flower, a complex statue—and turn it into a perfect 3D model. For years, this field has been dominated by methods that assume objects are “watertight,” meaning they are closed volumes with a clearly defined inside and outside. Think of a sphere or a cube; you are either inside it or outside it. ...

2025-03 · 9 min · 1741 words
[Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding 🔗](https://arxiv.org/abs/2503.18578)

Beyond the Flat Universe: How Galaxy Walker Brings Geometric Awareness to AI Astronomy

Beyond the Flat Universe: How Galaxy Walker Brings Geometric Awareness to AI Astronomy When we look at a photograph on a screen, we are looking at a flat, 2D representation of reality. For decades, computer vision models have operated on this same premise. They treat images as flat grids of pixels and process features in Euclidean (flat) vector spaces. But the universe is not flat. From the spherical orbits of planets to the hyperbolic expansion of the cosmos and the warping of spacetime around black holes, the universe is defined by complex, non-Euclidean geometries. When we force astronomical data into standard, flat-space Vision-Language Models (VLMs), we lose critical structural information. ...

2025-03 · 9 min · 1806 words
[GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control 🔗](https://arxiv.org/abs/2503.03751)

How GEN3C Brings 3D Consistency and Precise Camera Control to Video Generation

Introduction We are currently witnessing a golden age of generative video. Models like Sora, Runway, and Stable Video Diffusion can hallucinate breathtaking scenes from a simple text prompt or a single image. However, if you look closely, cracks begin to appear—specifically when the camera starts moving. Imagine generating a video of a room. As the camera pans left, new objects appear. If you pan back to the right, do those original objects reappear exactly as they were? Often, they don’t. The vase on the table might change color, or a window might vanish entirely. Furthermore, trying to tell a video model to “move the camera 2 meters forward and pan 30 degrees right” is notoriously difficult. Most models treat camera parameters as abstract numbers, struggling to translate them into geometrically accurate pixel shifts. ...

2025-03 · 9 min · 1807 words
[Functionality understanding and segmentation in 3D scenes 🔗](https://arxiv.org/abs/2411.16310)

Fun3DU: How AI Finds the 'Needle in a Haystack' of 3D Scenes

Introduction Imagine you are a robot in a kitchen. You receive a simple command: “Turn on the microwave.” To you, a human, this is trivial. You look at the microwave, spot the “Start” button, and press it. But for an Artificial Intelligence, this is a monumental challenge. First, the AI must understand that “turn on” implies interacting with a specific button. Second, it must visually locate that tiny button within a complex 3D environment filled with other objects, shadows, and occlusions. Standard computer vision models are great at finding the microwave (the whole object), but they often fail spectacularly at finding the specific functional part (the button) needed to complete a task. ...

2024-11 · 9 min · 1710 words
[Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers 🔗](https://arxiv.org/abs/2503.03307)

Unlocking 6-DoF Motion: How Event Cameras Can See Rotation and Translation Without an IMU

Unlocking 6-DoF Motion: How Event Cameras Can See Rotation and Translation Without an IMU Imagine trying to navigate a drone through a dense forest at high speed. A standard camera takes snapshots—click, click, click. If you move too fast between clicks, the world blurs, or you miss obstacles entirely. Enter the Event Camera. Instead of taking snapshots, it mimics the biological eye. It has pixels that work independently, firing a signal (an “event”) the instant they detect a change in brightness. This results in a continuous stream of data with microsecond latency, zero motion blur, and high dynamic range. ...

2025-03 · 8 min · 1679 words
[From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing 🔗](https://arxiv.org/abs/2411.11916)

Beyond Pixels: Using Multi-Agent AI to Generate and Edit Structured Diagrams

Introduction “Drawing is not what one sees but what one can make others see.” — Edgar Degas. In the world of scientific research, software engineering, and education, a picture is worth a thousand words—but only if that picture is accurate. While we have witnessed a revolution in generative AI with tools like Midjourney or DALL-E, there remains a glaring gap in the capability of these models: structured, logical diagrams. Ask a standard image generator to create a “neural network architecture with three layers,” and you will likely get a beautiful, artistic hallucination. The connections might go nowhere, the text will be illegible gibberish, and the logical flow will be nonexistent. On the other hand, asking a coding assistant to “write code for a plot” works for simple bar charts but often fails when the visual requirements become complex or unique, such as specific flowchart logic or intricate mind maps. ...

2024-11 · 9 min · 1788 words
[From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech 🔗](https://arxiv.org/abs/2503.16956)

Reading Lips like a Pro: How Hierarchical Learning Solves Video-to-Speech Synthesis

Imagine watching a video of a person speaking, but the sound is muted. You can see their lips moving, their facial expressions shifting, and their jaw moving. If you were asked to “dub” this video simply by watching it, could you do it? You might guess the words (lip-reading), but could you guess the sound of their voice? The pitch? The emotional inflection? This is the challenge of Video-to-Speech (VTS) synthesis. It’s a fascinating problem in computer vision and audio processing with applications ranging from restoring silent archival films to assisting people with speech disabilities. ...

2025-03 · 7 min · 1354 words
[FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis 🔗](https://arxiv.org/abs/2505.01172)

FreePCA: How Principal Component Analysis Unlocks Long Video Generation

The world of generative AI is moving fast. We’ve gone from blurry images to photorealistic portraits, and now, the frontier is video. Models like Sora and Runway Gen-2 have dazzled the internet, but behind the scenes, researchers face a stubborn hurdle: duration. Most open-source video diffusion models are trained on very short clips—often just 16 frames. When you ask these models to generate a longer video (say, 64 frames or more), they tend to break down. Objects morf bizarrely, the style shifts, or the video dissolves into noise. Training a model specifically for long videos requires massive computational resources and datasets that most labs simply don’t have. ...

2025-05 · 8 min · 1682 words
[FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling 🔗](https://arxiv.org/abs/2411.19942)

Why Your 3D Avatar's Skirt Looks Weird (And How 'FreeCloth' Fixes It)

Introduction If you have ever played a modern video game or worked with 3D animation, you might have noticed a peculiar trend: characters in tactical gear, tight superhero suits, or jeans look fantastic. But characters wearing long dresses, flowing skirts, or loose robes? They often look… odd. The fabric might stretch unnaturally between their legs, tear apart when they run, or look like a rigid plastic shell rather than flowing cloth. ...

2024-11 · 10 min · 2080 words
[Free-viewpoint Human Animation with Pose-correlated Reference Selection 🔗](https://arxiv.org/abs/2412.17290)

Breaking the Camera Angle Barrier in AI Video Generation

Introduction In the rapidly evolving world of Generative AI, animating human characters has become a frontier of intense research. We have seen impressive results where a single photograph of a person can be brought to life, driven by a video of a dancer or a speaker. Models like AnimateAnyone and MagicAnimate have set the standard for this “reference-based” animation. However, these models share a significant limitation: they are generally bound to the viewpoint of the original reference image. ...

2024-12 · 9 min · 1842 words