Papers

[Omnimanip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints 🔗](https://arxiv.org/abs/2501.03841)

Bridging the Gap: How OmniManip Connects VLM Reasoning to Precise Robot Action

The dream of general-purpose robotics is a machine that can walk into a messy kitchen, identify a teapot and a cup, and pour you a drink without having been explicitly programmed for that specific teapot or that specific cup. In recent years, we have seen massive leaps in Vision-Language Models (VLMs). These models (like GPT-4V) have incredible “common sense.” They can look at an image and tell you, “That is a teapot, you hold it by the handle, and you pour liquid from the spout.” However, knowing what to do is very different from knowing exactly how to do it in 3D space. ...

[Olympus: A Universal Task Router for Computer Vision Tasks 🔗](https://arxiv.org/abs/2412.09612)

From Jack-of-All-Trades to Master Conductor: How Olympus Redefines Multimodal AI

Introduction: The Dilemma of “All-in-One” AI In the rapidly evolving world of Artificial Intelligence, there is a massive race to build the ultimate “All-in-One” model. We’ve seen Multimodal Large Language Models (MLLMs) like GPT-4 and LLaVA that can see, read, and reason. We’ve seen Generative models like Stable Diffusion and Sora that can create breathtaking images and videos. Naturally, the industry’s instinct has been to smash these capabilities together into a single, gigantic neural network—a “Jack-of-All-Trades” that can generate, edit, segment, and answer questions all at once. While models like Emu3 and Omni-Gen have made strides here, they face a significant hurdle: conflicting objectives. Training a model to strictly reason about an image (understanding) often conflicts with training it to dream up new pixels (generation). Furthermore, these massive omni-models are incredibly expensive to train and difficult to scale. If a better image generator comes out next month, you have to retrain your entire massive model to use it. ...

[K Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding 🔗](https://arxiv.org/abs/2503.00361)

Taming Hallucinations in Vision-Language Models with the Octopus Framework

Introduction Imagine asking an AI to describe a picture of a soccer field. The model confidently replies, “A player in a green jersey is kicking the ball toward the goal.” It sounds perfect, except for one problem: there is no ball in the picture. This phenomenon is known as hallucination. Large Vision-Language Models (LVLMs), despite their incredible ability to understand images and text, frequently fabricate objects, attributes, or relationships that simply don’t exist. For casual use, this is annoying. For critical applications like medical imaging analysis or autonomous driving, it is dangerous. ...

[OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Cui_OPTICAL_Leveraging_Optimal_Transport_for_Contribution_Allocation_in_Dataset_Distillation_CVPR_2025_paper.pdf)

Stop Averaging Your Data: How Optimal Transport Revolutionizes Dataset Distillation

Stop Averaging Your Data: How Optimal Transport Revolutionizes Dataset Distillation In the current era of deep learning, we are witnessing a voracious appetite for data. Models like CLIP or modern Large Language Models consume millions, sometimes billions, of data points. While effective, this scale creates massive bottlenecks in storage and computation. Training a model from scratch on these massive datasets is becoming a privilege reserved for those with access to supercomputing clusters. ...

[O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models 🔗](https://arxiv.org/abs/2503.12096)

Fixing Overconfidence in VLMs: How Orthogonality Calibrates Test-Time Prompt Tuning

In the rapidly evolving world of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have become superstars. They possess an uncanny ability to understand images and text simultaneously, allowing them to classify objects they have never seen before during training—a capability known as zero-shot inference. To make these models even better, researchers use a technique called Test-Time Prompt Tuning (TPT). This method tweaks the model on the fly, adapting it to specific test samples without needing labeled training data. While TPT significantly boosts accuracy, it comes with a dangerous side effect: poor calibration. ...

[Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models 🔗](https://arxiv.org/abs/2503.17142)

Decoding Vision: How to Find Compositional Structures Inside Image Embeddings

Decoding Vision: How to Find Compositional Structures Inside Image Embeddings Humans are natural composers. When you see a “red car,” you don’t just see a unique, atomic entity; you instinctively understand it as a combination of an object (“car”) and an attribute (“red”). This ability to break down complex concepts into simpler, reusable parts is called compositionality. It allows us to understand things we’ve never seen before. If you know what “blue” looks like and what a “banana” looks like, you can imagine a “blue banana” without ever having encountered one. ...

[No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Qin_No_Pains_More_Gains_Recycling_Sub-Salient_Patches_for_Efficient_High-Resolution_CVPR_2025_paper.pdf)

Recycling Pixel Context: How to Train High-Res Vision Models Without Blowing Up Your GPU

Introduction In the world of computer vision, we are constantly fighting a war against hardware limitations. We want to process massive, gigapixel images—satellite maps, 4K medical scans, detailed pavement inspections—but our GPUs have finite memory (VRAM). The standard approach for handling these massive images typically falls into two buckets: Downsizing: Shrinking the image until it fits, which destroys the fine details needed for accurate diagnosis or detection. Patch Selection: Cutting the image into small grids, finding the few most “salient” (important) patches, and ignoring the rest. The second approach, patch selection, has become the industry standard for High-Resolution Image Recognition (HRIR). It’s smart: why process the empty sky in a satellite photo when you are looking for buildings? However, a recent paper titled “No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition” argues that this approach has a fatal flaw. By strictly focusing on the most important parts, we lose the “sub-salient” context—the background information that helps the model understand what it’s looking at. ...

[NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting 🔗](https://arxiv.org/abs/2503.18794)

Solving the Few-Shot Problem: How NexusGS Brings Geometric Rigor to 3D Gaussian Splatting

Introduction In the rapidly evolving world of computer vision, the dream has always been the same: take a handful of photos of an object or a scene, and instantly generate a photorealistic, navigable 3D model. For a long time, this was a difficult, computation-heavy task. Then came Neural Radiance Fields (NeRFs), which revolutionized the quality of view synthesis but were painfully slow to render. More recently, 3D Gaussian Splatting (3DGS) emerged, offering real-time rendering speeds with quality that rivals or exceeds NeRFs. ...

[NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction 🔗](https://arxiv.org/abs/2503.18361)

NeRFPrior: How Self-Supervised NeRFs Are Solving Indoor 3D Reconstruction

Introduction Reconstructing a high-quality 3D surface from a set of 2D images is one of the “Holy Grail” problems in computer vision. It sounds simple—humans do it effortlessly with two eyes—but for algorithms, converting a collection of photos into a watertight, smooth 3D mesh is incredibly difficult. This is especially true for indoor scenes, which are plagued by textureless walls, complex occlusions, and reflective surfaces. Modern approaches often rely on priors—pre-learned knowledge about what the world should look like—to guide the reconstruction process. Traditionally, these priors come from massive datasets. The logic is: “I’ve seen a thousand chairs before, so I know this shape is probably a chair.” However, this approach fails when the algorithm encounters something it hasn’t seen during training, or when the dataset bias doesn’t match the specific scene. ...

[NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhang_NTClick_Achieving_Precise_Interactive_Segmentation_With_Noise-tolerant_Clicks_CVPR_2025_paper.pdf)

Beyond the Pixel-Perfect Click: How NTClick Masters Fine-Grained Segmentation

Introduction We have all been there. You are trying to edit a photo, perhaps cutting out a subject to place on a new background. You use a smart selection tool, click on the object, and for the most part, it works like magic. But then you reach the hair, the bicycle spokes, or the thin strings of a kite. The magic fades. You find yourself zooming in to 400%, trying to click precisely on a pixel-thin line, only for the tool to select the entire sky instead. ...

[NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery 🔗](https://arxiv.org/abs/2506.06898)

Decoding the Mind’s Eye: How AI is Learning to Visualize Our Thoughts

Introduction The ability to look inside the human mind and see what a person is imagining has long been the realm of science fiction. From Inception to Black Mirror, the concept of a “dream recorder” captures our collective imagination. However, in the field of computational neuroscience, this is not fiction—it is an active, rapidly evolving area of research known as fMRI-to-image reconstruction. In recent years, we have seen an explosion in the capabilities of AI to reconstruct images that a person is viewing based solely on their brain activity. Models can now take functional Magnetic Resonance Imaging (fMRI) scans of a person looking at a surfer and generate a recognizable image of a surfer. But a much harder challenge remains: can we reconstruct what a person is merely imagining? ...

[NLPrompt: Noise-Label Prompt Learning for Vision-Language Models 🔗](https://arxiv.org/abs/2412.01256)

Can We Trust the Labels? Robust Prompt Learning with NLPrompt

The rise of Vision-Language Models (VLMs) like CLIP has fundamentally changed how we approach computer vision. Instead of training massive networks from scratch, we now have the luxury of “prompting” pre-trained models to recognize concepts they have already learned. By feeding a model an image and a text description like “A photo of a dog,” we can achieve zero-shot classification with impressive accuracy. However, crafting these text prompts by hand is tedious and suboptimal. This led to the rise of Prompt Learning (or Prompt Tuning), where we treat the text prompt as a set of learnable vectors and optimize them using back-propagation. It is a lightweight, efficient way to adapt massive models to specific downstream tasks. ...

[Multitwine: Multi-Object Compositing with Text and Layout Control 🔗](https://arxiv.org/abs/2502.05165)

Beyond Copy-Paste: Mastering Multi-Object Compositing with Multitwine

Beyond Copy-Paste: Mastering Multi-Object Compositing with Multitwine In the rapidly evolving world of Generative AI, editing images has moved far beyond simple pixel manipulation. We are in the era of “subject-driven generation,” where we can ask a model to insert a specific object into a specific scene. However, while tools like Stable Diffusion have mastered the art of generating singular objects, they hit a wall when the task gets complicated. ...

[Multirate Neural Image Compression with Adaptive Lattice Vector Quantization 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_Multirate_Neural_Image_Compression_with_Adaptive_Lattice_Vector_Quantization_CVPR_2025_paper.pdf)

Breaking the Grid—How Adaptive Lattices Are Revolutionizing Neural Image Compression

Breaking the Grid: How Adaptive Lattices Are Revolutionizing Neural Image Compression We live in a world dominated by visual data. From streaming 4K video to scrolling through Instagram, image compression is the invisible engine keeping the internet running. For decades, standards like JPEG defined this field. But in the last five years, Neural Image Compression—using Deep Neural Networks (DNNs) to encode images—has rapidly surpassed traditional hand-crafted methods. However, there is a bottleneck in these neural systems: Quantization. ...

[Multi-modal Vision Pre-training for Medical Image Analysis 🔗](https://arxiv.org/abs/2410.10604)

BrainMVP: Mastering Medical Image Analysis with Multi-Modal Pre-training

In the rapidly evolving world of medical artificial intelligence, the scarcity of labeled data is a persistent bottleneck. While deep learning models thrive on massive datasets, obtaining pixel-perfect annotations for medical scans—like outlining a brain tumor slice by slice—requires highly trained radiologists and hours of manual labor. To solve this, researchers have turned to Self-Supervised Learning (SSL). The idea is simple but powerful: let the AI teach itself the structure of the anatomy using unlabeled data before it ever sees a human-made label. ...

[Multilabel Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Duan_Multi-Label_Prototype_Visual_Spatial_Search_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2025_paper.pdf)

Stop Classifying, Start Searching: A New Approach to Weakly Supervised Semantic Segmentation

Introduction: The High Cost of Knowing Where Things Are In the world of computer vision, Semantic Segmentation is one of the “Holy Grail” tasks. It’s not enough to just say “there is a cat in this picture”; we want to know exactly which pixels belong to the cat. This level of detail is critical for autonomous driving (distinguishing the road from the sidewalk) and medical imaging (isolating a tumor from healthy tissue). ...

[MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond 🔗](https://arxiv.org/abs/2504.05046)

Why Pressure Matters: Integrating Physics into Human Motion Capture with MotionPRO and FRAPPE

Have you ever watched a CGI character in a movie or a humanoid robot trying to walk, and something just felt… off? The graphics might be perfect, and the robot’s joints might be shiny chrome, but the feet slide slightly across the floor as if they are skating, or the body seems to float without weight. This is a classic problem in Human Motion Capture (MoCap). Most modern AI systems rely purely on vision—RGB cameras—to estimate human poses. They are excellent at matching the visual geometry of a person (where the elbow is relative to the shoulder), but they are terrible at understanding physics. A camera sees a person, but it doesn’t “feel” the ground. It ignores the mass, the gravity, and the friction that dictate how we move. ...

[MonSter: Marry Monodepth to Stereo Unleashes Power 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Cheng_MonSter_Marry_Monodepth_to_Stereo_Unleashes_Power_CVPR_2025_paper.pdf)

MonSter: When Monocular Depth Meets Stereo Matching

Introduction In the world of computer vision, perceiving depth is everything. Whether it’s an autonomous vehicle navigating a busy street or a robot arm reaching for a cup, the machine needs to know exactly how far away objects are. For years, Stereo Matching has been the gold standard for this. Mimicking the human eyes, it uses two cameras to triangulate distance based on the disparity between images. But there is a catch: stereo matching relies on finding the exact same feature in both the left and right images. When a car drives into a tunnel with textureless white walls, or faces a highly reflective glass building, those matching cues disappear. The stereo vision essentially goes blind. ...

[Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification 🔗](https://arxiv.org/abs/2503.09962)

How to Teach AI to Write Like Thousands of Humans: A New Approach to Synthetic Data for Person ReID

How to Teach AI to Write Like Thousands of Humans: A New Approach to Synthetic Data for Person ReID In the rapidly evolving world of computer vision, data is the new oil. But for specific tasks, like Text-to-Image Person Re-identification (ReID), that oil is running dry. The cost of manually annotating millions of images with detailed textual descriptions is astronomical. Naturally, researchers have turned to Multimodal Large Language Models (MLLMs)—like GPT-4V or LLaVA—to generate synthetic captions. It sounds like the perfect solution: let the AI label the data. However, there is a catch. MLLMs are often too consistent. They tend to speak in a monotonous, “average” style, lacking the rich linguistic diversity that human annotators naturally provide. ...

[MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds 🔗](https://arxiv.org/abs/2405.17421)

Bringing Video to Life: How MoSca Reconstructs 4D Scenes from Casual Clips

Imagine taking a quick video of a street performance or a friend jumping into a pool with your smartphone. Now, imagine being able to freeze that video at any moment, rotate the camera to see the action from a completely new angle, or even remove a person from the scene entirely while keeping the background intact. This is the promise of 4D reconstruction—capturing both 3D geometry and its movement over time. However, doing this from a “casual monocular video” (a fancy term for a video shot with a single camera, like a phone, without fancy studio equipment) is one of the hardest problems in computer vision. ...