Papers

[Scene-Centric Unsupervised Panoptic Segmentation 🔗](https://arxiv.org/abs/2504.01955)

Can AI Understand Complex Scenes Without Labels? Inside CUPS

Can AI Understand Complex Scenes Without Labels? Inside CUPS Imagine you are teaching a child to recognize objects in a busy city street. You point to a car and say “car,” point to the road and say “road.” Eventually, the child learns. This is essentially how Supervised Learning works in computer vision: we feed algorithms thousands of images where every pixel is painstakingly labeled by humans. But what if you couldn’t speak? What if the child had to learn purely by observing the world? They might notice that a “car” is a distinct object because it moves against the background. They might realize the “road” is a continuous surface because of how it recedes into the distance. ...

[Scaling Vision Pre-Training to 4K Resolution 🔗](https://arxiv.org/abs/2503.19903)

Can AI See in 4K? Breaking the Resolution Barrier with PS3

Introduction Imagine driving down a highway. In the distance, you spot a sign. To read the text on it, your eyes naturally focus on that specific small area, perceiving it in high detail, while the rest of your peripheral vision remains lower resolution. You don’t process the entire landscape with the same microscopic intensity; that would overwhelm your brain. You prioritize. Current Artificial Intelligence, however, doesn’t work like that. Most modern Vision-Language Models (VLMs), like the ones powering GPT-4o or Gemini, struggle with high-resolution inputs. Standard vision encoders (like CLIP or SigLIP) are typically pre-trained at low resolutions, often around \(378 \times 378\) pixels. When these models face a 4K image, they either downsample it—turning that distant highway sign into a blurry mess—or they try to process the whole image in high resolution, which causes computational costs to explode quadratically. ...

[Scaling Inference Time Compute for Diffusion Models 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Ma_Scaling_Inference_Time_Compute_for_Diffusion_Models_CVPR_2025_paper.pdf)

Beyond Denoising: Unlocking the Power of Inference-Time Search in Diffusion Models

In the era of Generative AI, we have grown accustomed to a simple truth known as “Scaling Laws”: if you want a better model, you need to train it on more data, with more parameters, for a longer time. This recipe has driven the explosive success of Large Language Models (LLMs) and diffusion models alike. But recently, a new frontier has opened up in the LLM world. Researchers have discovered that you don’t always need a bigger model to get a smarter answer; sometimes, you just need to let the model “think” longer during inference. Techniques like Chain-of-Thought or Tree of Search allow models to scale their performance after training is complete, simply by using more compute power to generate the response. ...

[Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution 🔗](https://arxiv.org/abs/2502.07814)

Decoding the Weather with Diffusion - How Satellite Data Guides Super-Resolution

Decoding the Weather with Diffusion: How Satellite Data Guides Super-Resolution Weather forecasting is a game of scales. On a global level, we understand the movement of massive pressure systems and jet streams reasonably well. But as we zoom in—to the level of a city, a farm, or a wind turbine—things get blurry. The data we rely on, often from reanalysis datasets like ERA5, is typically provided in low-resolution grids (e.g., 25km x 25km blocks). ...

[Samba: A Unified Mamba-based Framework for General Salient Object Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/He_Samba_A_Unified_Mamba-based_Framework_for_General_Salient_Object_Detection_CVPR_2025_paper.pdf)

Can Mamba Beat Transformers? Exploring Samba for Salient Object Detection

When you look at a photograph, your eyes don’t process every pixel with equal intensity. You instantly focus on the “important” parts—a person waving, a bright red car, or a cat sitting on a fence. This biological mechanism is what computer vision researchers call Salient Object Detection (SOD). For years, the field of SOD has been a tug-of-war between two dominant architectures: Convolutional Neural Networks (CNNs) and Transformers. CNNs are efficient but struggle to understand the “whole picture” (limited receptive fields). Transformers are masters of global context but are computationally heavy, with complexity growing quadratically as image resolution increases. ...

[SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer 🔗](https://arxiv.org/abs/2503.15934)

Can Mamba Paint Like Van Gogh? Exploring SaMam for Efficient Style Transfer

Introduction In the world of computer vision, Image Style Transfer (ST) is one of the most visually captivating tasks. It enables us to take a content image (like a photograph of a street) and a style image (like Starry Night), merging them so the photograph looks as if Van Gogh painted it himself. While the artistic results are stunning, the engineering behind them faces a significant bottleneck: the trade-off between generation quality and computational efficiency. ...

[SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity 🔗](https://arxiv.org/abs/2503.20354)

SURGEON: How to Adapt Deep Learning Models on the Edge Without Running Out of Memory

Introduction Imagine you have trained a state-of-the-art computer vision model to detect pedestrians for a self-driving car. It works perfectly in sunny California where it was trained. But when you deploy it to a rainy street in London, accuracy plummets. The visual conditions—the “distribution”—have shifted. To fix this, researchers use Test-Time Adaptation (TTA). Instead of freezing the model after training, TTA allows the model to continue learning from the new, incoming data (like the rainy London streets) in real-time. It effectively fine-tunes the model “online” to adapt to the current environment. ...

[STING-BEE : Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection 🔗](https://arxiv.org/abs/2504.02823)

STING-BEE: Revolutionizing Airport Security with Vision-Language Models

Introduction Imagine standing in a bustling airport security line. As your bag disappears into the X-ray tunnel, a security officer stares intently at a monitor, deciphering a complex, pseudo-colored jumble of overlapping shapes. Their job is to identify threats—guns, knives, explosives—hidden amidst cables, laptops, and clothes. This task requires immense concentration, and human fatigue or distraction can lead to critical errors. While Artificial Intelligence has stepped in to assist with Computer-Aided Screening (CAS), current systems have a major limitation: they operate in a “closed-set” paradigm. This means they can only detect specific items they were explicitly trained on. If a threat is a novel variation—like a 3D-printed gun made of polymer or a dismantled explosive hidden inside a radio—traditional models often fail. Furthermore, general-purpose Large Multimodal Models (LMMs) like GPT-4, which excel at describing natural images, struggle significantly when presented with the translucent, overlapping nature of X-ray imagery. ...

[STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models 🔗](https://arxiv.org/abs/2408.16807)

STEREO: Bulletproofing Diffusion Models Against Concept Regeneration Attacks

Introduction The rise of Large-scale Text-to-Image Diffusion (T2ID) models, such as Stable Diffusion, has revolutionized digital creativity. With a simple text prompt, users can conjure photorealistic images, art, and designs. However, this power comes with significant risks. Trained on massive datasets scraped from the open internet, these models often inadvertently memorize and generate inappropriate content—ranging from NSFW material and copyrighted artistic styles to prohibited objects. To combat this, the field of Concept Erasure emerged. The goal is simple: modify the model so it refuses to generate specific “banned” concepts (like nudity or a specific artist’s style). Early methods showed promise, but researchers quickly discovered a glaring security hole. Even after a concept is “erased,” a clever adversary can bring it back. By using “jailbreak” prompts or injecting specific mathematical embeddings, attackers can bypass the erasure mechanisms, regenerating the very content the developers tried to hide. ...

[SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization 🔗](https://arxiv.org/abs/2409.17993)

Cracking the Cross-Modal Code: How SSHNet Aligns Images from Different Sensors Without Labels

Introduction In computer vision, one of the most fundamental tasks is alignment. Whether it is a drone navigating via satellite maps, a robot fusing infrared and visible light data, or a medical system overlaying MRI and CT scans, the system must understand how two images relate to each other geometrically. This relationship is often described by a homography—a transformation that maps points from one perspective to another. When images come from the same sensor (e.g., two standard photos), finding this relationship is relatively straightforward. We can simply slide one image over the other until the pixels match. However, the problem becomes significantly harder in cross-modal scenarios. How do you align a black-and-white thermal image with a color satellite photo? The pixel intensities are completely different; a hot engine is bright white in thermal but might be a dark grey block in a visible photo. ...

[SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts 🔗](https://arxiv.org/abs/2503.06467)

How Large Multimodal Models Are Solving the Data Scarcity Problem in 3D Object Detection

Introduction: The High Cost of Perception If you have ever played around with computer vision, you know the drill: models are hungry. They have an insatiable appetite for data, specifically labeled data. In the world of 2D images, drawing a box around a cat is relatively easy. But in the realm of autonomous driving, where perception relies on 3D point clouds generated by LiDAR, the game changes. Labeling a 3D scene is notoriously difficult. Annotators must navigate a complex, sparse 3D space, rotating views to draw precise 3D bounding boxes around cars, pedestrians, and cyclists. It is slow, expensive, and prone to human error. ...

[SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos 🔗](https://arxiv.org/abs/2412.09401)

Can We Map the World in Real-Time Without Camera Poses? Deep Dive into SLAM3R

Introduction One of the holy grails in computer vision is the ability to take a simple video from a smartphone and instantly turn it into a highly detailed, dense 3D model of the environment. Imagine walking through a room, filming it, and having a digital twin ready on your screen by the time you stop recording. For decades, this has been a massive challenge. Traditional methods force us to choose between quality and speed. You could have a highly accurate model if you were willing to wait hours for offline processing (using Structure-from-Motion and Multi-View Stereo). Or, you could have real-time performance using SLAM (Simultaneous Localization and Mapping), but often at the cost of sparse, noisy, or incomplete geometry. ...

[SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_SKDream_Controllable_Multi-view_and_3D_Generation_with_Arbitrary_Skeletons_CVPR_2025_paper.pdf)

Beyond Human Poses: Generating 3D Creatures with Arbitrary Skeletons using SKDream

Beyond Human Poses: Generating 3D Creatures with Arbitrary Skeletons using SKDream The field of generative AI has moved at a breakneck pace. We started with generating 2D images from text, moved to generating 3D assets, and are now pushing the boundaries of controllability. While text prompts like “a fierce dragon” are powerful, they leave a lot to chance. What if you want that dragon to be in a specific crouching pose? What if you want a tree with branches in exact locations? ...

[SCSA: A Plug-and-Play Semantic Continuous-Spare Attention for Arbitrary Semantic Style Transfer 🔗](https://arxiv.org/abs/2503.04119)

Bridging the Semantic Gap in Neural Style Transfer: A Deep Dive into SCSA

Introduction Neural Style Transfer (NST) has been one of the most visually captivating applications of Deep Learning. The ability to take a photo of your local park and render it in the swirling, impressionistic strokes of Van Gogh’s The Starry Night feels like magic. Over the years, the field has evolved from slow, optimization-based methods to “Arbitrary Style Transfer” (AST)—systems that can apply any style to any content image in real-time. ...

[SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation 🔗](https://arxiv.org/abs/2411.17646)

Making SAM2 Wiser: How to Teach a Segmentation Model to Understand Text and Time

The release of the Segment Anything Model (SAM) and its video successor, SAM2, marked a pivotal moment in computer vision. These models are incredibly powerful; given a point or a bounding box, they can segment an object with near-perfect accuracy and track it through a video. But there is a catch: SAM2 is “mute.” It doesn’t understand natural language. You cannot simply ask it to “segment the cat climbing the tree” or “track the red car turning left.” It requires explicit geometric prompts (clicks or boxes). Furthermore, while SAM2 is a master of pixel-matching, it lacks high-level reasoning about time and motion. It sees a video as a sequence of images to track, not as an event where actions unfold. ...

[SACB-Net: Spatial-awareness Convolutions for Medical Image Registration 🔗](https://arxiv.org/abs/2503.19592)

Beyond Shared Weights: How SACB-Net Adapts Convolutions for Medical Image Registration

Introduction In the world of medical imaging, alignment is everything. Whether a clinician is tracking the growth of a tumor over time or comparing a patient’s brain anatomy to a standard atlas, the images must be perfectly overlaid. This process is called Deformable Image Registration (DIR). For years, Deep Learning has revolutionized this field. Networks like VoxelMorph replaced computationally expensive iterative algorithms with fast, learning-based models. Most of these models rely on Convolutional Neural Networks (CNNs). However, standard CNNs have a fundamental trait that acts as a double-edged sword: spatial invariance. ...

[RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins 🔗](https://arxiv.org/abs/2504.13059)

RoboTwin: How Generative AI Creates Digital Twins to Train Dual-Arm Robots

Introduction In the world of robotics, we often marvel at videos of robots performing backflips or dancing. But ask a robot to coordinate two hands to place a pair of shoes neatly into a box, and you will likely see it struggle. Dual-arm coordination is the next frontier in robotic manipulation. While single-arm tasks (like picking and placing an object) have seen massive success, bimanual (two-armed) manipulation introduces exponential complexity. The arms must avoid colliding with each other, coordinate handovers, and handle objects that are too large or awkward for a single gripper. ...

[RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training 🔗](https://arxiv.org/abs/2411.17662)

RoboPEPP: Teaching AI to See Robot Poses Through Physical Intuition

Imagine a robot arm working in a bustling kitchen or on a manufacturing floor. To collaborate safely with humans or other machines, this robot needs to know exactly where it is in space relative to the camera watching it. This is known as robot pose estimation. Usually, this is straightforward because we can cheat: we ask the robot’s internal motor encoders what its joint angles are. But what if we can’t trust those sensors? What if we are observing a robot we don’t control? Or what if we simply want a purely vision-based redundancy system? ...

[Revisiting MAE pre-training for 3D medical image segmentation 🔗](https://arxiv.org/abs/2410.23132)

Spark3D: How Simple Masked Autoencoders are Revolutionizing 3D Medical Imaging

In the fast-paced world of Deep Learning, we often look for the “next big thing”—a new Transformer architecture, a complex loss function, or a revolutionary optimizer. However, sometimes the most significant breakthroughs come not from inventing something entirely new, but from taking a simple, powerful idea and engineering it to perfection. This is exactly what the authors of the paper “Revisiting MAE pre-training for 3D medical image segmentation” have achieved. They took the concept of Masked Autoencoders (MAEs)—a technique that has dominated natural language processing and computer vision—and rigorously adapted it for 3D medical imaging. ...

[Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition 🔗](https://arxiv.org/abs/2411.18941)

Can AI Tell the Difference Between Writing and Typing? How ProtoGCN Uses Prototypes to Master Fine-Grained Action Recognition

Introduction Imagine watching a silhouette of a person sitting at a desk. Their arms are moving. Are they writing a letter, or are they typing on a keyboard? To the casual observer, and indeed to many computer vision algorithms, these two actions look remarkably similar. The posture is the same; the active body parts (arms and hands) are the same. The difference lies in the subtle, fine-grained details of how those joints move relative to one another. ...