[Unlocking Generalization Power in LiDAR Point Cloud Registration 🔗](https://arxiv.org/abs/2503.10149)

Why Less is More - Removing Cross-Attention to Solve LiDAR Generalization

In the rapidly evolving world of autonomous driving and robotics, sensors are the eyes of the machine. LiDAR (Light Detection and Ranging) stands out as a critical sensor, providing precise 3D maps of the environment. However, raw 3D points are just the starting point. To make sense of the world, a vehicle must “register” these point clouds—stitching together scans taken at different times or from different locations to calculate its own movement and map its surroundings. ...

2025-03 · 8 min · 1626 words
[Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation 🔗](https://arxiv.org/abs/2412.01027)

InstaManip: Teaching AI to Edit Images by Example Using Group Self-Attention

Introduction: The Limits of Language in Image Editing We are currently living through a golden age of text-to-image generation. Models like Midjourney, DALL-E, and Stable Diffusion have made it incredibly easy to conjure detailed worlds from a simple sentence. However, a significant gap remains between generating an image from scratch and editing an existing one precisely. Consider a specific scenario: You have a photo of a standard sedan, and you want to transform it into a Lamborghini. You type the instruction: “Make it a Lamborghini.” ...

2024-12 · 10 min · 2068 words
[Universal Scene Graph Generation 🔗](https://arxiv.org/abs/2503.15005)

One Graph to Rule Them All - Unifying Vision, Text, and 3D with Universal Scene Graphs

Introduction Imagine you are a robot walking into a room. You see a man sitting on a sofa. You hear someone say, “Peter is relaxing.” Your depth sensors tell you the sofa is against a wall. As humans, we process all this information seamlessly. We don’t create a separate mental model for what we see, another for what we hear, and a third for spatial depth. We integrate them into a single understanding of the scene: Peter is on the sofa against the wall. ...

2025-03 · 9 min · 1750 words
[Unified Reconstruction of Static and Dynamic Scenes from Events 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Gao_Unified_Reconstruction_of_Static_and_Dynamic_Scenes_from_Events_CVPR_2025_paper.pdf)

Seeing the Unseen - How URSEE Reconstructs Static Worlds from Dynamic Event Cameras

Introduction Imagine a camera that works like the human eye. It doesn’t take snapshots frame-by-frame; instead, it only reacts when something changes. If you stare at a perfectly still wall, your optic nerve stops firing signals about the wall (though your eyes make tiny, imperceptible movements to prevent this blindness). This is the principle behind Event Cameras (or Dynamic Vision Sensors). They are revolutionary pieces of technology that capture brightness changes asynchronously with microsecond precision. They excel at capturing high-speed motion—think catching a bullet in flight or a drone dodging obstacles—without the motion blur or low dynamic range of standard cameras. ...

8 min · 1634 words
[UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior 🔗](https://arxiv.org/abs/2501.13134)

Bridging the Gap: How UniRestore Unifies Human Vision and AI Perception

Imagine you are driving an autonomous vehicle through a thick, heavy fog. For you, the driver, the goal is Perceptual Image Restoration (PIR). You want the fog cleared from your vision so you can see the scenery, the road texture, and the world in high fidelity. You care about aesthetics and clarity. For the car’s computer, however, the goal is Task-Oriented Image Restoration (TIR). The AI doesn’t care if the trees look pretty; it cares about edge detection, object classification, and semantic segmentation. It needs to know exactly where the pedestrian is and where the lane marker ends. ...

2025-01 · 9 min · 1855 words
[UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics 🔗](https://arxiv.org/abs/2412.07774)

UniReal: Unifying Image Generation and Editing by Learning from Video Dynamics

Introduction In the rapidly evolving world of Generative AI, we have witnessed a fragmentation of tools. If you want to generate an image from scratch, you might use Stable Diffusion or Midjourney. If you want to change the style of an existing photo, you might look for a style-transfer adapter. If you want to insert a specific product into a background, you might need a specialized object-insertion model like AnyDoor. ...

2024-12 · 10 min · 1921 words
[UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing 🔗](https://arxiv.org/abs/2411.16781)

UniPose: Unifying Human Pose Comprehension, Generation, and Editing with LLMs

Introduction In the rapidly evolving landscape of computer vision and robotics, understanding human movement is fundamental. Whether it’s for Virtual Reality (VR), healthcare monitoring, or creating digital avatars, the ability for machines to perceive, describe, and replicate human body language is crucial. Traditionally, this field has been fragmented. If you wanted to estimate a 3D pose from an image, you used one specific model. If you wanted to generate a 3D animation from a text description like “a person running,” you used a completely different generative model. And if you wanted to edit a pose—say, taking a sitting character and making them cross their legs—that required yet another specialized system. ...

2024-11 · 10 min · 1969 words
[Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video 🔗](https://arxiv.org/abs/2503.21761)

How Uni4D Reconstructs 4D Worlds from Casual Video Without Training

Imagine recording a video of a busy street corner with your phone. You capture cars driving by, pedestrians crossing the street, and the static buildings towering above. To you, it’s just a video. But to a computer vision researcher, it is a complex puzzle of 3D geometry and time—a “4D” scene. Reconstructing a full 4D model (3D space + time) from a single, casual video is one of the Holy Grails of computer vision. Traditionally, this is incredibly difficult. You have to figure out where the camera is moving, what part of the scene is static background, what is moving, and how those moving objects change shape over time. ...

2025-03 · 8 min · 1668 words
[Understanding Multi-Task Activities from Single-Task Videos 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Shen_Understanding_Multi-Task_Activities_from_Single-Task_Videos_CVPR_2025_paper.pdf)

Cooking Dinner While Making Coffee: How AI Learns to Multitask from Single-Task Demos

Introduction Imagine your typical morning routine. You aren’t just a robot executing one program file named make_breakfast.exe. You turn on the stove to cook oatmeal, and while that simmers, you turn around to grind coffee beans. Maybe you pause to pack a lunch. You are interleaving steps from multiple different tasks into a single, continuous flow of activity. For humans, this is second nature. For Artificial Intelligence, specifically Computer Vision systems, this is a nightmare. ...

8 min · 1558 words
[UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion 🔗](https://arxiv.org/abs/2501.11515)

Beyond HDR: How UltraFusion Uses Generative Inpainting for 9-Stop Dynamic Range

Have you ever tried to take a photo of a cityscape at night? You are usually faced with a frustrating choice: expose for the bright neon lights and the buildings become black silhouettes, or expose for the buildings and the lights turn into blown-out white blobs. To solve this, modern cameras use High Dynamic Range (HDR) imaging. They take a burst of photos at different brightness levels and stitch them together. It works well for standard scenes—usually dealing with exposure differences of 3 to 4 “stops.” But what happens when the difference is extreme—say, 9 stops? Or when things are moving fast in the frame? ...

2025-01 · 8 min · 1649 words
[USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting 🔗](https://arxiv.org/abs/2411.10504)

How USP-Gaussian Solves the "Cascading Error" Problem in High-Speed 3D Vision

Imagine you are trying to create a 3D model of a scene using a camera mounted on a high-speed train or a racing drone. Traditional cameras fail here—they suffer from massive motion blur due to fixed exposure times. This is where spike cameras come in. Inspired by the biological retina, these sensors capture light as a continuous stream of binary spikes (0s and 1s) at frequencies up to 40,000 Hz, theoretically eliminating motion blur. ...

2024-11 · 8 min · 1686 words
[UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units 🔗](https://arxiv.org/abs/2505.09393)

Motion Capture Without Cameras: How UMotion Fuses Uncertainty, Physics, and AI

Introduction For decades, accurate 3D human motion capture (MoCap) was restricted to Hollywood studios and high-end research labs. It required a controlled environment, dozens of cameras, and actors wearing suits covered in reflective markers. In recent years, the focus has shifted to “in-the-wild” motion capture—tracking movement anywhere, from a living room to a hiking trail, using wearable sensors. The most common solution involves Inertial Measurement Units (IMUs)—the same sensors found in your smartphone or smartwatch that track acceleration and rotation. ...

2025-05 · 9 min · 1864 words
[UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models 🔗](https://arxiv.org/abs/2412.11441)

The Invisible Trojan: Understanding UIBDiffusion and the Future of AI Security

The Invisible Trojan: Understanding UIBDiffusion and the Future of AI Security Generative AI has fundamentally changed how we create digital content. At the forefront of this revolution are Diffusion Models (DMs), the engines behind tools like Stable Diffusion and DALL-E, which can conjure photorealistic images from simple text prompts. These models are powerful, but their strength relies on massive datasets scraped from the web. This reliance on external data creates a massive security vulnerability: Data Poisoning. ...

2024-12 · 9 min · 1725 words
[UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning 🔗](https://arxiv.org/abs/2506.07087)

Breaking Camouflage: How UCOD-DPL Masters Unsupervised Detection with Dynamic Learning

Introduction In nature, survival often depends on the ability to disappear. From the leaf-tailed gecko blending into tree bark to the arctic hare vanishing into the snow, camouflage is a sophisticated evolutionary biological mechanism designed to evade predators. In the world of Computer Vision, replicating the predator’s ability to spot these hidden creatures is known as Camouflaged Object Detection (COD). COD is significantly harder than standard object detection. The targets share similar textures, colors, and patterns with the background, making boundaries incredibly difficult to discern. While fully supervised deep learning methods have made strides in this area, they come with a heavy cost: they require massive datasets with pixel-perfect human annotations. Labeling a camouflaged object is laborious and expensive because the objects are, by definition, hard to see. ...

2025-06 · 8 min · 1635 words
[Type-R: Automatically Retouching Typos for Text-to-Image Generation 🔗](https://arxiv.org/abs/2411.18159)

Type-R: How AI Can Finally Spell Correctly in Generated Images

Type-R: How AI Can Finally Spell Correctly in Generated Images If you have ever played with text-to-image models like Stable Diffusion, DALL-E 3, or Flux, you are likely familiar with a very specific frustration. You type a prompt asking for a cool cyber-punk poster that says “FUTURE,” and the model generates a breathtaking image… with the text “FUTRE,” “FUTUUE,” or perhaps some alien hieroglyphics that look vaguely like English. While generative AI has mastered lighting, texture, and composition, it is notoriously bad at spelling. This phenomenon, often called the “spaghetti text” problem, renders many generated images unusable for professional graphic design without heavy manual editing. ...

2024-11 · 8 min · 1535 words
[Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks 🔗](https://arxiv.org/abs/2407.21121)

Taming the Sine Waves: A Deep Dive into Robust Training for Implicit Neural Representations

If you have been following the cutting edge of computer vision and signal processing, you have likely encountered Implicit Neural Representations (INRs). Unlike satisfyingly discrete grids of pixels or voxels, INRs represent data (like images, 3D shapes, or audio) as a continuous mathematical function, usually approximated by a neural network. The current superstar of INRs is the Sinusoidal Neural Network, popularized by the SIREN architecture. Instead of standard ReLU activations, these networks use sine waves. They are mathematically elegant and capable of capturing incredible high-frequency details. ...

2024-07 · 7 min · 1370 words
[Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better 🔗](https://arxiv.org/abs/2503.19904)

Stop the Flicker: How Tracktention Uses Point Tracking to Master Video Consistency

If you have ever tried to use a single-image AI model to process a video frame-by-frame, you are likely familiar with the “flicker” problem. Whether it is depth estimation, style transfer, or colorization, applying an image model to a video usually results in a jittery, inconsistent mess. The ground shakes, colors shift randomly, and objects morph shape from one second to the next. This happens because standard image models have no concept of time. They don’t know that the chair in frame 10 is the same object as the chair in frame 11. ...

2025-03 · 9 min · 1843 words
[Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models 🔗](https://arxiv.org/abs/2502.07601)

Can AI Spot the Defect? Inside Anomaly-OneVision, the Specialist Visual Assistant

Imagine you are a quality control inspector on a factory line. Thousands of components pass by every hour. Your job isn’t just to spot a broken part; you have to explain why it’s broken. Is it a scratch? A dent? Is the soldering messy? Now, imagine trying to teach an AI to do this. While modern Multimodal Large Language Models (MLLMs) like GPT-4o are incredible at describing a sunset or reading a menu, they struggle significantly when asked to find a microscopic crack in a screw or a slight discoloration on a medical scan. They lack the “specialist” eye required for Anomaly Detection (AD). ...

2025-02 · 7 min · 1456 words
[Towards Scalable Human-aligned Benchmark for Text-guided Image Editing 🔗](https://arxiv.org/abs/2505.00502)

Beyond "Looks Good": How HATIE Automates Human-Like Evaluation for Image Editing

Introduction We are living in the golden age of generative AI. With the advent of diffusion models, we can conjure vivid worlds from a single sentence. But as the technology matures, the focus is shifting from simple generation (creating an image from scratch) to editing (modifying an existing image). Imagine you have a photograph of a living room and you want to “add a vase to the table” or “change the dog into a cat.” It sounds simple, but evaluating whether an AI model has done this job well is notoriously difficult. ...

2025-05 · 10 min · 2095 words
[Towards RAW Object Detection in Diverse Conditions 🔗](https://arxiv.org/abs/2411.15678)

Why Robots Should See in RAW: Unlocking Object Detection in Extreme Weather

When you snap a photo with your smartphone, a massive amount of processing happens instantly. The sensor captures a raw signal, but before it reaches your screen, an Image Signal Processor (ISP) compresses it, adjusts the colors, balances the white, and tone-maps the shadows. The result is an sRGB image—optimized for the human eye. But here is the critical question for computer vision researchers: Is an image optimized for human vision actually good for machine vision? ...

2024-11 · 7 min · 1486 words