[Seeing more with less: human-like representations in vision models 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Gizdov_Seeing_More_with_Less_Human-like_Representations_in_Vision_Models_CVPR_2025_paper.pdf)

Seeing More with Less: How Foveated Vision Optimizes AI Models

The human eye is a marvel of biological engineering, but it is also surprisingly economical. We do not perceive the world in uniform high definition. Instead, we possess a fovea—a small central region of high acuity—surrounded by a periphery that progressively blurs into low resolution. This mechanism allows us to process complex scenes efficiently, allocating limited biological resources (photoreceptors and optic nerve bandwidth) where they matter most. In contrast, modern Computer Vision (CV) and Large Multimodal Models (LMMs) are brute-force processors. They typically ingest images at a uniform, high resolution across the entire Field of View (FOV). While effective, this approach is computationally expensive and bandwidth-heavy. ...

8 min · 1629 words
[SeedVR: Seeding Infinity in Diffusion Transformer Toward Generic Video Restoration 🔗](https://arxiv.org/abs/2501.01320)

SeedVR: Breaking the Speed and Resolution Limits of Video Restoration

Video restoration is a classic computer vision problem with a modern twist. We all have footage—whether it’s old family home movies, low-quality streams, or AI-generated clips—that suffers from blur, noise, or low resolution. The goal of Generic Video Restoration (VR) is to take these low-quality (LQ) inputs and reconstruct high-quality (HQ) outputs, recovering details that seem lost to time or compression. Recently, diffusion models have revolutionized this field. By treating restoration as a generative task, they can hallucinate realistic textures that traditional methods blur out. However, this power comes at a steep price: computational cost. ...

2025-01 · 8 min · 1677 words
[SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks 🔗](https://arxiv.org/abs/2503.06965)

Bridging the Gap Between Sky and Ground: A Deep Dive into SeCap for Cross-View Person Re-ID

Introduction In the evolving landscape of intelligent surveillance, we are witnessing a convergence of two distinct worlds: the ground and the sky. Traditional security systems rely heavily on CCTV cameras fixed at eye level or slightly above. However, the rapid proliferation of Unmanned Aerial Vehicles (UAVs), or drones, has introduced a new vantage point. This combination offers comprehensive coverage, but it introduces a massive computational headache known as Aerial-Ground Person Re-Identification (AGPReID). ...

2025-03 · 10 min · 1977 words
[Scene-Centric Unsupervised Panoptic Segmentation 🔗](https://arxiv.org/abs/2504.01955)

Can AI Understand Complex Scenes Without Labels? Inside CUPS

Can AI Understand Complex Scenes Without Labels? Inside CUPS Imagine you are teaching a child to recognize objects in a busy city street. You point to a car and say “car,” point to the road and say “road.” Eventually, the child learns. This is essentially how Supervised Learning works in computer vision: we feed algorithms thousands of images where every pixel is painstakingly labeled by humans. But what if you couldn’t speak? What if the child had to learn purely by observing the world? They might notice that a “car” is a distinct object because it moves against the background. They might realize the “road” is a continuous surface because of how it recedes into the distance. ...

2025-04 · 8 min · 1663 words
[Scaling Vision Pre-Training to 4K Resolution 🔗](https://arxiv.org/abs/2503.19903)

Can AI See in 4K? Breaking the Resolution Barrier with PS3

Introduction Imagine driving down a highway. In the distance, you spot a sign. To read the text on it, your eyes naturally focus on that specific small area, perceiving it in high detail, while the rest of your peripheral vision remains lower resolution. You don’t process the entire landscape with the same microscopic intensity; that would overwhelm your brain. You prioritize. Current Artificial Intelligence, however, doesn’t work like that. Most modern Vision-Language Models (VLMs), like the ones powering GPT-4o or Gemini, struggle with high-resolution inputs. Standard vision encoders (like CLIP or SigLIP) are typically pre-trained at low resolutions, often around \(378 \times 378\) pixels. When these models face a 4K image, they either downsample it—turning that distant highway sign into a blurry mess—or they try to process the whole image in high resolution, which causes computational costs to explode quadratically. ...

2025-03 · 8 min · 1643 words
[Scaling Inference Time Compute for Diffusion Models 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Ma_Scaling_Inference_Time_Compute_for_Diffusion_Models_CVPR_2025_paper.pdf)

Beyond Denoising: Unlocking the Power of Inference-Time Search in Diffusion Models

In the era of Generative AI, we have grown accustomed to a simple truth known as “Scaling Laws”: if you want a better model, you need to train it on more data, with more parameters, for a longer time. This recipe has driven the explosive success of Large Language Models (LLMs) and diffusion models alike. But recently, a new frontier has opened up in the LLM world. Researchers have discovered that you don’t always need a bigger model to get a smarter answer; sometimes, you just need to let the model “think” longer during inference. Techniques like Chain-of-Thought or Tree of Search allow models to scale their performance after training is complete, simply by using more compute power to generate the response. ...

8 min · 1684 words
[Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution 🔗](https://arxiv.org/abs/2502.07814)

Decoding the Weather with Diffusion - How Satellite Data Guides Super-Resolution

Decoding the Weather with Diffusion: How Satellite Data Guides Super-Resolution Weather forecasting is a game of scales. On a global level, we understand the movement of massive pressure systems and jet streams reasonably well. But as we zoom in—to the level of a city, a farm, or a wind turbine—things get blurry. The data we rely on, often from reanalysis datasets like ERA5, is typically provided in low-resolution grids (e.g., 25km x 25km blocks). ...

2025-02 · 9 min · 1707 words
[Samba: A Unified Mamba-based Framework for General Salient Object Detection 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/He_Samba_A_Unified_Mamba-based_Framework_for_General_Salient_Object_Detection_CVPR_2025_paper.pdf)

Can Mamba Beat Transformers? Exploring Samba for Salient Object Detection

When you look at a photograph, your eyes don’t process every pixel with equal intensity. You instantly focus on the “important” parts—a person waving, a bright red car, or a cat sitting on a fence. This biological mechanism is what computer vision researchers call Salient Object Detection (SOD). For years, the field of SOD has been a tug-of-war between two dominant architectures: Convolutional Neural Networks (CNNs) and Transformers. CNNs are efficient but struggle to understand the “whole picture” (limited receptive fields). Transformers are masters of global context but are computationally heavy, with complexity growing quadratically as image resolution increases. ...

7 min · 1458 words
[SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer 🔗](https://arxiv.org/abs/2503.15934)

Can Mamba Paint Like Van Gogh? Exploring SaMam for Efficient Style Transfer

Introduction In the world of computer vision, Image Style Transfer (ST) is one of the most visually captivating tasks. It enables us to take a content image (like a photograph of a street) and a style image (like Starry Night), merging them so the photograph looks as if Van Gogh painted it himself. While the artistic results are stunning, the engineering behind them faces a significant bottleneck: the trade-off between generation quality and computational efficiency. ...

2025-03 · 8 min · 1565 words
[SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity 🔗](https://arxiv.org/abs/2503.20354)

SURGEON: How to Adapt Deep Learning Models on the Edge Without Running Out of Memory

Introduction Imagine you have trained a state-of-the-art computer vision model to detect pedestrians for a self-driving car. It works perfectly in sunny California where it was trained. But when you deploy it to a rainy street in London, accuracy plummets. The visual conditions—the “distribution”—have shifted. To fix this, researchers use Test-Time Adaptation (TTA). Instead of freezing the model after training, TTA allows the model to continue learning from the new, incoming data (like the rainy London streets) in real-time. It effectively fine-tunes the model “online” to adapt to the current environment. ...

2025-03 · 10 min · 2080 words
[STING-BEE : Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection 🔗](https://arxiv.org/abs/2504.02823)

STING-BEE: Revolutionizing Airport Security with Vision-Language Models

Introduction Imagine standing in a bustling airport security line. As your bag disappears into the X-ray tunnel, a security officer stares intently at a monitor, deciphering a complex, pseudo-colored jumble of overlapping shapes. Their job is to identify threats—guns, knives, explosives—hidden amidst cables, laptops, and clothes. This task requires immense concentration, and human fatigue or distraction can lead to critical errors. While Artificial Intelligence has stepped in to assist with Computer-Aided Screening (CAS), current systems have a major limitation: they operate in a “closed-set” paradigm. This means they can only detect specific items they were explicitly trained on. If a threat is a novel variation—like a 3D-printed gun made of polymer or a dismantled explosive hidden inside a radio—traditional models often fail. Furthermore, general-purpose Large Multimodal Models (LMMs) like GPT-4, which excel at describing natural images, struggle significantly when presented with the translucent, overlapping nature of X-ray imagery. ...

2025-04 · 9 min · 1852 words
[STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models 🔗](https://arxiv.org/abs/2408.16807)

STEREO: Bulletproofing Diffusion Models Against Concept Regeneration Attacks

Introduction The rise of Large-scale Text-to-Image Diffusion (T2ID) models, such as Stable Diffusion, has revolutionized digital creativity. With a simple text prompt, users can conjure photorealistic images, art, and designs. However, this power comes with significant risks. Trained on massive datasets scraped from the open internet, these models often inadvertently memorize and generate inappropriate content—ranging from NSFW material and copyrighted artistic styles to prohibited objects. To combat this, the field of Concept Erasure emerged. The goal is simple: modify the model so it refuses to generate specific “banned” concepts (like nudity or a specific artist’s style). Early methods showed promise, but researchers quickly discovered a glaring security hole. Even after a concept is “erased,” a clever adversary can bring it back. By using “jailbreak” prompts or injecting specific mathematical embeddings, attackers can bypass the erasure mechanisms, regenerating the very content the developers tried to hide. ...

2024-08 · 8 min · 1591 words
[SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization 🔗](https://arxiv.org/abs/2409.17993)

Cracking the Cross-Modal Code: How SSHNet Aligns Images from Different Sensors Without Labels

Introduction In computer vision, one of the most fundamental tasks is alignment. Whether it is a drone navigating via satellite maps, a robot fusing infrared and visible light data, or a medical system overlaying MRI and CT scans, the system must understand how two images relate to each other geometrically. This relationship is often described by a homography—a transformation that maps points from one perspective to another. When images come from the same sensor (e.g., two standard photos), finding this relationship is relatively straightforward. We can simply slide one image over the other until the pixels match. However, the problem becomes significantly harder in cross-modal scenarios. How do you align a black-and-white thermal image with a color satellite photo? The pixel intensities are completely different; a hot engine is bright white in thermal but might be a dark grey block in a visible photo. ...

2024-09 · 9 min · 1855 words
[SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts 🔗](https://arxiv.org/abs/2503.06467)

How Large Multimodal Models Are Solving the Data Scarcity Problem in 3D Object Detection

Introduction: The High Cost of Perception If you have ever played around with computer vision, you know the drill: models are hungry. They have an insatiable appetite for data, specifically labeled data. In the world of 2D images, drawing a box around a cat is relatively easy. But in the realm of autonomous driving, where perception relies on 3D point clouds generated by LiDAR, the game changes. Labeling a 3D scene is notoriously difficult. Annotators must navigate a complex, sparse 3D space, rotating views to draw precise 3D bounding boxes around cars, pedestrians, and cyclists. It is slow, expensive, and prone to human error. ...

2025-03 · 10 min · 1932 words
[SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos 🔗](https://arxiv.org/abs/2412.09401)

Can We Map the World in Real-Time Without Camera Poses? Deep Dive into SLAM3R

Introduction One of the holy grails in computer vision is the ability to take a simple video from a smartphone and instantly turn it into a highly detailed, dense 3D model of the environment. Imagine walking through a room, filming it, and having a digital twin ready on your screen by the time you stop recording. For decades, this has been a massive challenge. Traditional methods force us to choose between quality and speed. You could have a highly accurate model if you were willing to wait hours for offline processing (using Structure-from-Motion and Multi-View Stereo). Or, you could have real-time performance using SLAM (Simultaneous Localization and Mapping), but often at the cost of sparse, noisy, or incomplete geometry. ...

2024-12 · 8 min · 1697 words
[SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_SKDream_Controllable_Multi-view_and_3D_Generation_with_Arbitrary_Skeletons_CVPR_2025_paper.pdf)

Beyond Human Poses: Generating 3D Creatures with Arbitrary Skeletons using SKDream

Beyond Human Poses: Generating 3D Creatures with Arbitrary Skeletons using SKDream The field of generative AI has moved at a breakneck pace. We started with generating 2D images from text, moved to generating 3D assets, and are now pushing the boundaries of controllability. While text prompts like “a fierce dragon” are powerful, they leave a lot to chance. What if you want that dragon to be in a specific crouching pose? What if you want a tree with branches in exact locations? ...

9 min · 1781 words
[SCSA: A Plug-and-Play Semantic Continuous-Spare Attention for Arbitrary Semantic Style Transfer 🔗](https://arxiv.org/abs/2503.04119)

Bridging the Semantic Gap in Neural Style Transfer: A Deep Dive into SCSA

Introduction Neural Style Transfer (NST) has been one of the most visually captivating applications of Deep Learning. The ability to take a photo of your local park and render it in the swirling, impressionistic strokes of Van Gogh’s The Starry Night feels like magic. Over the years, the field has evolved from slow, optimization-based methods to “Arbitrary Style Transfer” (AST)—systems that can apply any style to any content image in real-time. ...

2025-03 · 9 min · 1757 words
[SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation 🔗](https://arxiv.org/abs/2411.17646)

Making SAM2 Wiser: How to Teach a Segmentation Model to Understand Text and Time

The release of the Segment Anything Model (SAM) and its video successor, SAM2, marked a pivotal moment in computer vision. These models are incredibly powerful; given a point or a bounding box, they can segment an object with near-perfect accuracy and track it through a video. But there is a catch: SAM2 is “mute.” It doesn’t understand natural language. You cannot simply ask it to “segment the cat climbing the tree” or “track the red car turning left.” It requires explicit geometric prompts (clicks or boxes). Furthermore, while SAM2 is a master of pixel-matching, it lacks high-level reasoning about time and motion. It sees a video as a sequence of images to track, not as an event where actions unfold. ...

2024-11 · 7 min · 1361 words
[SACB-Net: Spatial-awareness Convolutions for Medical Image Registration 🔗](https://arxiv.org/abs/2503.19592)

Beyond Shared Weights: How SACB-Net Adapts Convolutions for Medical Image Registration

Introduction In the world of medical imaging, alignment is everything. Whether a clinician is tracking the growth of a tumor over time or comparing a patient’s brain anatomy to a standard atlas, the images must be perfectly overlaid. This process is called Deformable Image Registration (DIR). For years, Deep Learning has revolutionized this field. Networks like VoxelMorph replaced computationally expensive iterative algorithms with fast, learning-based models. Most of these models rely on Convolutional Neural Networks (CNNs). However, standard CNNs have a fundamental trait that acts as a double-edged sword: spatial invariance. ...

2025-03 · 9 min · 1732 words
[RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins 🔗](https://arxiv.org/abs/2504.13059)

RoboTwin: How Generative AI Creates Digital Twins to Train Dual-Arm Robots

Introduction In the world of robotics, we often marvel at videos of robots performing backflips or dancing. But ask a robot to coordinate two hands to place a pair of shoes neatly into a box, and you will likely see it struggle. Dual-arm coordination is the next frontier in robotic manipulation. While single-arm tasks (like picking and placing an object) have seen massive success, bimanual (two-armed) manipulation introduces exponential complexity. The arms must avoid colliding with each other, coordinate handovers, and handle objects that are too large or awkward for a single gripper. ...

2025-04 · 8 min · 1564 words