[MesgGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation 🔗](https://arxiv.org/abs/2505.04656)

MeshGen: A New Standard for High-Fidelity 3D Mesh and PBR Texture Generation

Introduction The race to bridge the gap between 2D images and 3D content creation is moving at breakneck speed. We have seen massive leaps in diffusion models that can conjure images from thin air, and naturally, researchers are applying these principles to the third dimension. However, generating a high-quality, production-ready 3D asset from a single image remains a formidable challenge. Current state-of-the-art methods generally fall into two camps: optimization-based methods (which are slow) and Large Reconstruction Models (LRMs) that predict 3D representations like NeRFs or Gaussians directly. While LRMs are fast, they often struggle when converting those volumetric representations into clean, editable meshes. Furthermore, “native” 3D diffusion models—which attempt to learn the distribution of 3D shapes directly—often produce overly simple, symmetric blobs that lack the sharp geometric details of the input image. ...

2025-05 · 9 min · 1808 words
[Matrix3D: Large Photogrammetry Model All-in-One 🔗](https://arxiv.org/abs/2502.07685)

Matrix3D: The All-in-One Generative Model Revolutionizing Photogrammetry

Introduction For decades, the field of computer vision has chased a specific dream: taking a handful of flat, 2D photographs and instantly converting them into a perfect, navigable 3D world. This process, known as photogrammetry, is the backbone of modern 3D content creation, mapping, and special effects. However, the traditional road to 3D reconstruction is bumpy. It usually involves a fragmented pipeline of disparate algorithms—one to figure out where the cameras were looking, another to estimate depth, and yet another to stitch it all together. ...

2025-02 · 8 min · 1618 words
[Material Anything: Generating Materials for Any 3D Object via Diffusion 🔗](https://arxiv.org/abs/2411.15138)

Material Anything: The New Standard for Automated 3D Material Generation

In the world of computer graphics, creating a 3D model is only half the battle. The shape—or geometry—gives an object its form, but the material gives it its soul. Is the object made of shiny gold, dull wood, or rusted iron? How does light bounce off its scratches? For years, creating these physically accurate materials has been a tedious bottleneck. Artists often use complex software like Substance 3D Painter to manually paint textures. While recent AI advancements have automated 3D geometry generation, they often fail at the next step: generating high-quality materials. Most AI models simply “paint” colors onto a shape, baking in lighting and shadows that make the object look fake when moved to a new environment. ...

2024-11 · 7 min · 1484 words
[ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping 🔗](https://arxiv.org/abs/2412.16212)

Solving the Invisible Hand: How ManiVideo Masters 3D Occlusion in Video Generation

Solving the Invisible Hand: How ManiVideo Masters 3D Occlusion in Video Generation If you have ever tried to draw a hand, you know the struggle. Getting the proportions right is hard enough, but the real nightmare begins when the fingers start curling, overlapping, and gripping objects. Suddenly, parts of the hand disappear behind the object, or behind other fingers. Now, imagine teaching an AI to not just draw this interaction, but to generate a temporally consistent video of it. ...

2024-12 · 8 min · 1658 words
[MangaNinja: Line Art Colorization with Precise Reference Following 🔗](https://arxiv.org/abs/2501.08332)

Bringing Sketches to Life: How MangaNinja Masters Line Art Colorization

The transition from a black-and-white sketch to a fully colored character is one of the most labor-intensive steps in animation and comic production. For decades, artists have manually filled in colors, ensuring that a character’s hair, eyes, and clothing remain consistent across thousands of frames. While automated tools have attempted to speed up this process, they often stumble when presented with a simple reality of animation: characters move. When a character turns their head, zooms in, or changes pose, the geometry of the line art changes drastically compared to the reference sheet. Traditional colorization algorithms frequently fail here, painting outside the lines or confusing the color of a shirt with the color of the skin. ...

2025-01 · 8 min · 1610 words
[MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps 🔗](https://arxiv.org/abs/2503.18223)

Decoding the Wild: How Multimodal AI is Revolutionizing Wildlife Monitoring in the Swiss Alps

Decoding the Wild: How Multimodal AI is Revolutionizing Wildlife Monitoring in the Swiss Alps Imagine attempting to document the daily lives of elusive mountain creatures—Red Deer, Wolves, or Mountain Hares—without ever stepping foot in the forest. For decades, ecologists have relied on camera traps to act as their eyes in the wild. These motion-activated sensors capture millions of images and videos, offering an unprecedented glimpse into biodiversity. However, a new problem has emerged: we have too much data. With modern camera traps capable of recording high-definition video for weeks on end, researchers are drowning in footage. Manually annotating this data to understand not just what animal is present, but what it is doing (behavior), is a Herculean task. ...

2025-03 · 7 min · 1449 words
[MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking 🔗](https://arxiv.org/abs/2411.15459)

Can Mamba Beat Transformers? Understanding MambaVLT for Vision-Language Tracking

Introduction Imagine trying to track a friend in a crowded video. Sometimes you know what they look like (a visual reference), and sometimes you only know a description, like “the person wearing a red hat.” Now, imagine the video is long. Your friend changes pose, walks behind a tree, or takes off the hat. To keep tracking them effectively, you need memory. You need to remember their history to predict where they are now. ...

2024-11 · 9 min · 1764 words
[Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation 🔗](https://arxiv.org/abs/2504.03193)

Bridging the Gap: How Mamba Fuses Vision and Language Models for Robust Segmentation

Imagine training a self-driving car algorithm entirely inside the video game Grand Theft Auto V. The roads look realistic, the lighting is perfect, and the weather is controlled. Now, take that same car and drop it onto a rainy street in London at night. Does it crash? This scenario represents the core challenge of Domain Generalized Semantic Segmentation (DGSS). We want models that learn from a “source” domain (like a simulation or a sunny dataset) and perform flawlessly in “target” domains (real-world, bad weather, night) without ever seeing them during training. ...

2025-04 · 8 min · 1678 words
[Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters 🔗](https://arxiv.org/abs/2411.18197)

From Statue to Actor: How 'Make-It-Animatable' Rigs Characters in Under One Second

In the rapidly expanding worlds of video games, VR, and the metaverse, 3D content creation is booming. We have incredible tools to generate static 3D models from text or images, resulting in millions of digital assets. However, a significant bottleneck remains: movement. A static 3D model is essentially a digital statue. To make it move—to make it run, jump, or dance—it must undergo two complex processes: rigging (building a digital skeleton) and skinning (defining how the surface moves with that skeleton). Traditionally, this is the domain of skilled technical artists, taking hours of manual labor per character. Even existing automated tools often struggle with non-standard body shapes or characters that aren’t standing in a perfect “T-pose.” ...

2024-11 · 8 min · 1492 words
[MUSt3R: Multi-view Network for Stereo 3D Reconstruction 🔗](https://arxiv.org/abs/2503.01661)

Breaking the Pairwise Barrier: How MUSt3R Scales 3D Reconstruction to Arbitrary Views

Imagine dumping a folder of random photos—taken with different cameras, from different angles, with no metadata—into a system and getting a perfect, dense 3D model out the other side. This is the “Holy Grail” of geometric computer vision: unconstrained Structure-from-Motion (SfM). Recently, a method called DUSt3R (Dense Unconstrained Stereo 3D Reconstruction) made waves by solving this problem without the complex, handcrafted pipelines of traditional photogrammetry. It treated 3D reconstruction as a simple regression task. However, DUSt3R had a significant Achilles’ heel: it was fundamentally a pairwise method. It looked at two images at a time. If you wanted to reconstruct a scene from 100 images, you faced a combinatorial explosion of pairs to process and a messy global alignment problem to stitch them together. ...

2025-03 · 8 min · 1570 words
[MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving 🔗](https://arxiv.org/abs/2504.00379)

Bridging the Gap: How MPDrive Teaches Autonomous Vehicles to Speak 'Spatial'

Introduction Imagine you are driving down a busy highway. You see a car merging from the right, a truck braking in front of you, and a pedestrian waiting on a corner. Instantly, your brain maps these objects in 3D space, assigns them importance, and formulates a plan: “Slow down for the truck, watch the merging car.” You don’t think in raw GPS coordinates or pixel values. You think in terms of objects and relationships. ...

2025-04 · 9 min · 1877 words
[MLLM-as-a-Judge for Image Safety without Human Labeling 🔗](https://arxiv.org/abs/2501.00192)

Can AI Moderate Itself? Building a Zero-Shot Image Safety Judge without Human Labels

Introduction In the era of AI-Generated Content (AIGC), the volume of visual media being created and shared online is exploding. From social media feeds to generative art platforms, the flow of images is endless. But with this creativity comes a significant risk: the proliferation of harmful content, ranging from graphic violence to explicit material. For years, the standard solution has been human moderation. Platforms hire thousands of people to look at disturbing images and label them as “safe” or “unsafe” based on a rulebook (a safety constitution). This approach has two massive problems: it is expensive and slow to scale, and it takes a heavy psychological toll on the human annotators. ...

2025-01 · 9 min · 1773 words
[MITracker: Multi-View Integration for Visual Object Tracking 🔗](https://arxiv.org/abs/2502.20111)

Seeing Through Walls: How MITracker Solves Occlusion with Multi-View Fusion

Imagine you are watching a soccer game. If a player runs behind a referee, you don’t panic and assume the player has vanished from existence. Your brain uses context, trajectory, and perhaps a view from a different angle (if you were watching a multi-camera broadcast) to predict exactly where that player will emerge. Computer vision systems, however, often struggle with this exact scenario. In the world of Visual Object Tracking (VOT), losing sight of an object due to occlusion (being blocked by another object) is a primary cause of failure. Traditional trackers rely on a single camera view. If the target walks behind a pillar, the tracker is blind. ...

2025-02 · 10 min · 2063 words
[MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views 🔗](https://arxiv.org/abs/2412.06767)

How MAtCha Gaussians Solves 3D Reconstruction with Just a Few Images

The dream of computer vision is simple yet incredibly difficult to achieve: take a handful of photos of an object or a scene, and instantly generate a perfect, photorealistic 3D model. In recent years, we have seen an explosion in “neural rendering” techniques. Methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have revolutionized our ability to synthesize novel views. They can take a set of images and allow you to look at the scene from a new angle with startling clarity. However, there is a catch. While these methods produce beautiful images, the underlying 3D geometry they create is often messy, noisy, or blurry. They are designed to fool the eye, not to build a solid mesh. ...

2024-12 · 10 min · 1949 words
[MATCHA: Towards Matching Anything 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xue_MATCHA_Towards_Matching_Anything_CVPR_2025_paper.pdf)

One Feature to Rule Them All: Understanding MATCHA for Unified Image Correspondence

Introduction “In computer vision, there is only one problem: correspondence, correspondence, correspondence.” This famous quote by Takeo Kanade highlights a fundamental truth about how machines “see.” Whether a robot is navigating a room, an AI is editing a photo, or a system is tracking a moving car, the core task is almost always the same: identifying which pixel in Image A corresponds to which pixel in Image B. However, historically, we haven’t treated this as one problem. We have fragmented it into three distinct domains: ...

8 min · 1517 words
[MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors 🔗](https://arxiv.org/abs/2412.12392)

Can We Solve SLAM Without Calibration? A Deep Dive into MASt3R-SLAM

Visual Simultaneous Localization and Mapping (SLAM) is often considered the “Holy Grail” of spatial intelligence. Ideally, we want a robot or a pair of AR glasses to open its “eyes” (cameras), look at a scene, and immediately understand where it is and what the world looks like in 3D—without any manual setup. However, the reality of SLAM has traditionally been finicky. It usually requires strict hardware expertise, careful camera calibration (those checkerboard patterns), and reliable feature extractors. While “sparse” SLAM (tracking points) works well, “dense” SLAM (reconstructing the whole surface) remains computationally heavy and prone to drift. ...

2024-12 · 8 min · 1686 words
[MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations 🔗](https://arxiv.org/abs/2503.15871)

Why Video AIs Hallucinate: Disentangling Action and Scene with MASH-VLM

Imagine showing an AI a video of a person boxing. The catch? They are doing it inside a library. A typical Video Large Language Model (Video-LLM) might look at the bookshelves and quiet atmosphere and completely ignore the boxing, describing the scene as “students reading.” Or, it might see the boxing motion and hallucinate a “boxing ring” in the background, ignoring the books entirely. This phenomenon is known as Action-Scene Hallucination. It occurs when a model relies too heavily on the context of the scene to guess the action, or uses the action to incorrectly infer the scene. ...

2025-03 · 7 min · 1339 words
[MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation 🔗](https://arxiv.org/abs/2503.20519)

Breaking the 3D Barrier: A Deep Dive into MAR-3D and Progressive Auto-Regressive Generation

The transition from 2D image generation to 3D content creation is one of the most exciting, yet technically challenging, frontiers in modern AI. While models like Midjourney or Stable Diffusion can dream up photorealistic images in seconds, generating a high-quality, watertight 3D mesh that looks good from every angle is a significantly harder problem. The challenges are structural. Unlike images, which are neat grids of pixels, 3D data is unordered and sparse. Traditional methods often struggle to balance high geometric resolution with computational efficiency. Today, we are diving deep into MAR-3D, a novel framework presented by researchers from the National University of Singapore and collaborators. This paper introduces a “Progressive Masked Auto-regressor” that fundamentally changes how we approach high-resolution 3D generation. ...

2025-03 · 9 min · 1762 words
[Light3R-SfM: Towards Feed-forward Structure-from-Motion 🔗](https://arxiv.org/abs/2501.14914)

3D Reconstruction at Light Speed: Understanding Light3R-SfM

The dream of computer vision is to take a handful of photos scattered around a scene—a statue, a building, or a room—and instantly weave them into a perfect 3D model. This process is known as Structure-from-Motion (SfM). For decades, SfM has been a game of trade-offs. You could have high accuracy, but you had to pay for it with minutes or even hours of computation time, relying on complex optimization algorithms like Bundle Adjustment. Recently, deep learning entered the chat with models like DUSt3R and MASt3R, which improved robustness but still relied on slow, iterative optimization steps to align everything globally. ...

2025-01 · 12 min · 2346 words
[Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes 🔗](https://arxiv.org/abs/2501.05226)

How to Reconstruct 3D Clouds from a Single Image using Diffusion and Physics

Have you ever looked at a photograph of a cloud and wondered exactly what it looked like in three dimensions? It seems like a simple question, but for a computer, it is a nightmare scenario. Clouds are not solid objects; they are volumetric, semi-transparent, and scatter light in complex ways. Reconstructing a 3D object from a single 2D image is a classic “ill-posed” problem in computer vision. It’s ill-posed because a single 2D image is essentially a flat shadow of reality—infinite different 3D shapes could theoretically produce that exact same image depending on the lighting and angle. ...

2025-01 · 9 min · 1722 words