Papers

[Can Generative Video Models Help Pose Estimation? 🔗](https://arxiv.org/abs/2412.16155)

Bridging the Gap - How Generative Video Models Solve Impossible Pose Estimation Problems

Introduction: The Human Ability to “Hallucinate” Geometry Imagine you are standing in a classroom. You take a photo of the blackboard at the front. Then, you turn around and walk to the back of the room, taking a photo of a student’s desk. These two photos have zero overlap—there are no common visual features between them. If you feed these two images into a traditional computer vision algorithm and ask, “Where is the second camera located relative to the first?”, the algorithm will fail. It looks for matching pixels, keypoints, or textures. Finding none, it cannot mathematically compute the geometry. ...

[CRISP: Object Pose and Shape Estimation with Test-Time Adaptation 🔗](https://arxiv.org/abs/2412.01052)

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation Imagine a robot arm tasked with cleaning up space debris. It sees a satellite floating in orbit. To grab it safely, the robot needs to know two things with high precision: where the satellite is (its pose) and what it looks like geometrically (its shape). In a controlled lab environment with perfect data, this is a solvable problem. But in the real world—or in space—lighting changes, sensors add noise, and objects might look slightly different than the 3D models the robot was trained on. This is known as the domain gap, and it is one of the biggest hurdles in computer vision today. ...

[COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts 🔗](https://arxiv.org/abs/2504.10158)

When Models Leave the Lab: Benchmarking AI in the Wild with COUNTS

Introduction Imagine you are training a self-driving car system. You train it on thousands of hours of video footage taken in sunny California. The model achieves 99% accuracy in detecting pedestrians, other cars, and stop signs. Then, you deploy the car in a snowy Canadian town or a dimly lit tunnel. Suddenly, the system fails to recognize a pedestrian wearing a winter coat against a white background. This scenario illustrates one of the most persistent challenges in modern computer vision and Artificial Intelligence: Out-of-Distribution (OOD) generalization. ...

[CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering 🔗](https://arxiv.org/abs/2503.00413)

How to Teach AI New Tricks Without Forgetting the Old: A Deep Dive into CL-MoE

Imagine you are learning a second language. You spend months mastering French. Then, you switch gears to learn Spanish. A few months later, you try to speak French again, but you find yourself inserting Spanish words or, worse, you’ve forgotten the French grammar entirely. This phenomenon, known in psychology as catastrophic forgetting, is a massive headache for Artificial Intelligence, specifically for Multimodal Large Language Models (MLLMs). These models, like GPT-4V or Gemini, are incredibly powerful at understanding images and answering questions about them. However, the world changes constantly. We want these models to learn new types of data and tasks continuously without having to retrain them from scratch (which costs millions of dollars) and—crucially—without them forgetting what they learned previously. ...

[CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_CH3Depth_Efficient_and_Flexible_Depth_Foundation_Model_with_Flow_Matching_CVPR_2025_paper.pdf)

Solving the Depth Estimation Trilemma: Inside CH3Depth

Depth estimation—the ability to look at a 2D image and understand the 3D geometry within it—is a cornerstone of computer vision. It is the prerequisite for autonomous driving, robot navigation, mixed reality, and content generation. However, building an “ideal” depth estimation model has historically been a game of trade-offs. You usually have to pick two of the following three: Meticulous Detail: Can the model see the fine edges of leaves or the texture of a distant building? Temporal Consistency: If applied to a video, does the depth map flicker, or is it stable over time? Efficiency: Can it run in real-time on a robot, or does it take several seconds per frame? Recent foundation models like Marigold or Depth Anything have pushed the boundaries of detail, but often at the cost of speed or video stability. In this post, we will explore CH3Depth, a new research paper that proposes a unified framework to solve this trilemma. Using a technique called Flow Matching combined with a novel sampling strategy, CH3Depth achieves state-of-the-art results in both image and video depth estimation while being significantly faster than its predecessors. ...

[CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Tian_CCIN_Compositional_Conflict_Identification_and_Neutralization_for_Composed_Image_Retrieval_CVPR_2025_paper.pdf)

When Pictures Clash with Words: Solving Compositional Conflicts in Image Retrieval with CCIN

Introduction Imagine you are shopping online for a shirt. You find a photo of a shirt that has the perfect cut and fabric, but it’s blue, and you really wanted it in grey. In a standard text search, describing exactly what you want is difficult (“shirt like this but grey”). This scenario is where Composed Image Retrieval (CIR) shines. CIR allows users to search using a combination of a reference image (the blue shirt) and a text instruction (“change blue to grey”). Ideally, the system understands that it should keep the shape and fabric from the image but swap the color based on the text. ...

[CASP: Compression of Large Multimodal Models Based on Attention Sparsity 🔗](https://arxiv.org/abs/2503.05936)

Breaking the 2-Bit Barrier: How Attention Sparsity Unlocks Extreme Compression for Multimodal Models

Introduction In the rapidly evolving world of Artificial Intelligence, Large Multimodal Models (LMMs) have emerged as the new titans. Models like LLaVA and GPT-4V can see, read, and reason, bridging the gap between visual and textual data. However, this capability comes at a steep price: computational resources. To put this into perspective, running a 70-billion parameter model like LLaVA-Onevision at standard 16-bit precision requires roughly 140GB of GPU memory. This effectively walls off these powerful models from consumer hardware and efficient edge deployment. To solve this, researchers turn to model compression, specifically quantization—reducing the precision of the model’s weights (e.g., from 16-bit floating point to 4-bit or 2-bit integers). ...

[CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design 🔗](https://arxiv.org/abs/2504.19478)

Solving the Tetris of Interior Design: How CASAGPT Uses Cuboids for Collision-Free Scene Synthesis

Introduction Imagine you are trying to furnish a virtual apartment. You place a stylish L-shaped sofa in the corner and a coffee table in the nook of the “L”. To you, this is a perfect, cozy arrangement. But to a computer using traditional 3D understanding, you might have just caused a catastrophe. Why? Because many computer vision models view objects not as complex shapes, but as simple bounding boxes. If the bounding box of the table touches the bounding box of the sofa—even if the actual objects don’t touch—the computer screams “Collision!” Conversely, models might generate scenes where objects physically overlap in reality because the coarse boxes allowed it. ...

[CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction 🔗](https://arxiv.org/abs/2411.16170)

Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy

Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy In the rapidly evolving world of Computer Vision, the Vision Transformer (ViT) has been a revolutionary force. By adapting the self-attention mechanisms originally designed for Natural Language Processing (NLP), ViTs have achieved state-of-the-art results in image classification, object detection, and segmentation. However, there is a catch. The very mechanism that makes Transformers so powerful—Self-Attention—is computationally expensive. Specifically, it has “quadratic complexity.” As the resolution of an image increases, the computational cost explodes. This makes standard Transformers notoriously difficult to deploy on resource-constrained devices like mobile phones, where battery life and latency are critical. ...

[CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image 🔗](https://arxiv.org/abs/2504.11230)

Breaking the Parts Barrier: How CAP-Net Masters Articulated Object Perception

Breaking the Parts Barrier: How CAP-Net Masters Articulated Object Perception Imagine you are a robot tasked with a seemingly simple household chore: opening a laptop. To a human, this is trivial. You identify the lid, find the edge, and lift. But to a robot, this is a geometric nightmare. The laptop is not a solid brick; it is an articulated object—a structure composed of rigid parts connected by joints. The lid moves relative to the base, changing the object’s overall shape. ...

[CADDreamer: CAD Object Generation from Single-view Images 🔗](https://arxiv.org/abs/2502.20732)

From Pixels to Parts: How CADDreamer Generates Editable CAD Models from Single Images

Introduction: The Gap Between AI Art and Engineering In the last few years, generative AI has transformed how we visualize ideas. Tools like Midjourney or Stable Diffusion can conjure photorealistic scenes from a text prompt, and recent breakthroughs in 3D generation—like DreamFusion or Wonder3D—can turn a single 2D image into a rotating 3D asset. However, if you are an engineer, a product designer, or a game developer, you likely face a frustrating reality: generated 3D meshes are often useless for manufacturing. ...

[Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Cheng_Breaking_the_Memory_Barrier_of_Contrastive_Loss_via_Tile-Based_Strategy_CVPR_2025_paper.pdf)

How to Train CLIP with Infinite Batch Sizes: Breaking the Memory Barrier

How to Train CLIP with Infinite Batch Sizes: Breaking the Memory Barrier In the world of modern AI, specifically in Representation Learning, there is a recurring theme: bigger is usually better. This is particularly true for contrastive learning models like CLIP (Contrastive Language-Image Pre-training). The secret sauce behind these models isn’t just the architecture; it’s the data, and more importantly, how much data the model sees at once. Research has consistently shown that larger batch sizes lead to better performance. A larger batch provides a more diverse set of “negative” samples (images that don’t match the text), forcing the model to learn much sharper, more discriminative features. ...

[Boost Your Human Image Generation Model via Direct Preference Optimization 🔗](https://arxiv.org/abs/2405.20216)

Crossing the Uncanny Valley: How HG-DPO Uses Real Images to Train Better Diffusion Models

Introduction We have all seen them: AI-generated portraits that look almost right, but something is off. Perhaps the skin texture is too plastic, the eyes lack a certain spark, or the anatomy twists in ways human bones simply shouldn’t. Despite the massive leaps in diffusion models like Stable Diffusion, generating truly photorealistic humans remains one of the hardest challenges in computer vision. The core issue often lies in how these models are fine-tuned. Typically, researchers use methods like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods train the model by showing it two generated images—one “good” and one “bad”—and telling it to prefer the good one. But there is a ceiling to this approach: if the model’s “good” image is still artificial and flawed, the model is only learning to be the “best of a bad bunch.” It isn’t learning what real looks like. ...

[Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB 🔗](https://arxiv.org/abs/2411.19474)

Why Blurry LiDAR and RGB Are the Future of Handheld 3D Scanning

In the world of computer vision and robotics, 3D reconstruction is the holy grail. Whether it’s a robot navigating a warehouse, a VR headset mapping your living room, or a Mars rover scanning a dune, the ability to turn the real world into a digital 3D model is critical. For years, the gold standard for handheld scanning (like what you might find on a high-end smartphone) has been a combination of an RGB camera and a sparse LiDAR sensor. This setup works reasonably well in perfect conditions. But the real world isn’t perfect. We encounter dark rooms, white textureless walls, and black objects that absorb light. In these “challenging” scenarios, traditional RGB-based reconstruction fails because it can’t “see” features, and sparse LiDAR fails because it doesn’t capture enough data points to fill in the gaps. ...

[BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing 🔗](https://arxiv.org/abs/2504.01786)

Can AI Master Blender? Inside BlenderGym and the Quest for Automated 3D Editing

The world of 3D graphics—the backbone of modern video games, blockbuster movies, and architectural visualization—is notoriously complex. Creating a photorealistic scene isn’t just about artistic vision; it requires technical mastery of sophisticated software like Blender, Maya, or Unreal Engine. An artist doesn’t just “draw” a 3D chair; they manipulate geometry nodes, adjust material shaders, tweak lighting coordinates, and wrangle with physics simulations. Because this process is so time-consuming and specialized, researchers have been racing to automate it using Artificial Intelligence. We’ve seen the rise of Vision-Language Models (VLMs) that can look at an image and understand what’s in it. The dream is simple: tell an AI, “Make the lights dimmer and turn that wooden table into glass,” and have it execute the task instantly. ...

[Balanced Rate-Distortion Optimization in Learned Image Compression 🔗](https://arxiv.org/abs/2502.20161)

Balancing Act: How Multi-Objective Optimization Boosts Learned Image Compression

Balancing Act: How Multi-Objective Optimization Boosts Learned Image Compression In the world of digital media, we are constantly fighting a tug-of-war. On one side, we want high-quality images that look crisp and true to life (Low Distortion). On the other, we want files that are small enough to stream, store, and share instantly (Low Rate). This trade-off is the heart of image compression. Traditional codecs like JPEG or HEVC solve this with hand-tuned engineering. But recently, Learned Image Compression (LIC)—using deep neural networks to compress images—has started to outperform these traditional methods. LIC models learn from data how to best represent an image. ...

[BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Liu_BWFormer_Building_Wireframe_Reconstruction_from_Airborne_LiDAR_Point_Cloud_with_CVPR_2025_paper.pdf)

From Sparse Clouds to Sharp Edges: Reconstructing 3D Buildings with BWFormer

Introduction Imagine trying to draw a precise blueprint of a house, but all you have is a grainy, satellite-like scan taken from a plane flying overhead. Some parts of the roof are missing, trees are blocking the walls, and the data is just a collection of scattered dots. This is the reality of reconstructing 3D building models from airborne LiDAR (Light Detection and Ranging) point clouds. Building reconstruction is a cornerstone technology for smart cities, autonomous driving, and Virtual Reality/Augmented Reality (VR/AR). While we have become good at capturing data, turning that raw, noisy data into clean, lightweight “wireframes” (skeletal representations of geometry) remains a massive challenge. ...

[BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance 🔗](https://arxiv.org/abs/2502.19694)

Cleaning Up the Streets: How BEVDiffuser Enhances Autonomous Driving Perception without Slowing It Down

Introduction Imagine driving down a highway at night in pouring rain. Your eyes strain to distinguish between a parked car on the shoulder and a shadow, or between a distant streetlight and an oncoming vehicle. Now, imagine you are a computer algorithm trying to do the same thing. In autonomous driving, the vehicle’s “brain” typically relies on a Bird’s-Eye-View (BEV) representation. This is a top-down, grid-like map of the surroundings generated from onboard cameras and LiDAR sensors. This map is the foundation for everything the car does next: detecting objects, predicting movement, and planning a path. ...

[BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction 🔗](https://arxiv.org/abs/2503.19340)

When Geometry Meets Generative AI: A Deep Dive into BADGR

Introduction Imagine you are standing in the middle of a room, holding a camera, and taking a \(360^{\circ}\) panoramic photo. Now, walk into the next room and take another one. Can you reconstruct the entire floor plan of the house—accurate to the centimeter—just from those two photos? This is the problem of Wide-Baseline Floor Plan Reconstruction, and it is notoriously difficult. Unlike a video feed where camera frames are millimeters apart, wide-baseline images are taken far apart (often in different rooms). The visual overlap is small, and traditional computer vision algorithms struggle to stitch these “islands” of data together into a coherent map. ...

[Assessing and Learning Alignment of Unimodal Vision and Language Models 🔗](https://arxiv.org/abs/2412.04616)

Breaking Up with CLIP - How to Build Better Vision-Language Models with 94% Less Data

For the last few years, the recipe for building a Vision-Language Model (VLM) has been relatively static. If you wanted a model that understood how images and text relate—like OpenAI’s CLIP—you needed to collect a massive dataset of hundreds of millions of image-text pairs and train two neural networks (one for vision, one for text) from scratch. This process is computationally expensive, data-hungry, and often results in models that are “jacks of all trades, masters of none.” The vision encoder might be decent, and the text encoder might be passable, but neither is state-of-the-art compared to models dedicated to a single modality. ...