Papers

[UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics 🔗](https://arxiv.org/abs/2412.07774)

UniReal: Unifying Image Generation and Editing by Learning from Video Dynamics

Introduction In the rapidly evolving world of Generative AI, we have witnessed a fragmentation of tools. If you want to generate an image from scratch, you might use Stable Diffusion or Midjourney. If you want to change the style of an existing photo, you might look for a style-transfer adapter. If you want to insert a specific product into a background, you might need a specialized object-insertion model like AnyDoor. ...

[UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing 🔗](https://arxiv.org/abs/2411.16781)

UniPose: Unifying Human Pose Comprehension, Generation, and Editing with LLMs

Introduction In the rapidly evolving landscape of computer vision and robotics, understanding human movement is fundamental. Whether it’s for Virtual Reality (VR), healthcare monitoring, or creating digital avatars, the ability for machines to perceive, describe, and replicate human body language is crucial. Traditionally, this field has been fragmented. If you wanted to estimate a 3D pose from an image, you used one specific model. If you wanted to generate a 3D animation from a text description like “a person running,” you used a completely different generative model. And if you wanted to edit a pose—say, taking a sitting character and making them cross their legs—that required yet another specialized system. ...

[Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video 🔗](https://arxiv.org/abs/2503.21761)

How Uni4D Reconstructs 4D Worlds from Casual Video Without Training

Imagine recording a video of a busy street corner with your phone. You capture cars driving by, pedestrians crossing the street, and the static buildings towering above. To you, it’s just a video. But to a computer vision researcher, it is a complex puzzle of 3D geometry and time—a “4D” scene. Reconstructing a full 4D model (3D space + time) from a single, casual video is one of the Holy Grails of computer vision. Traditionally, this is incredibly difficult. You have to figure out where the camera is moving, what part of the scene is static background, what is moving, and how those moving objects change shape over time. ...

[Understanding Multi-Task Activities from Single-Task Videos 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Shen_Understanding_Multi-Task_Activities_from_Single-Task_Videos_CVPR_2025_paper.pdf)

Cooking Dinner While Making Coffee: How AI Learns to Multitask from Single-Task Demos

Introduction Imagine your typical morning routine. You aren’t just a robot executing one program file named make_breakfast.exe. You turn on the stove to cook oatmeal, and while that simmers, you turn around to grind coffee beans. Maybe you pause to pack a lunch. You are interleaving steps from multiple different tasks into a single, continuous flow of activity. For humans, this is second nature. For Artificial Intelligence, specifically Computer Vision systems, this is a nightmare. ...

[UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion 🔗](https://arxiv.org/abs/2501.11515)

Beyond HDR: How UltraFusion Uses Generative Inpainting for 9-Stop Dynamic Range

Have you ever tried to take a photo of a cityscape at night? You are usually faced with a frustrating choice: expose for the bright neon lights and the buildings become black silhouettes, or expose for the buildings and the lights turn into blown-out white blobs. To solve this, modern cameras use High Dynamic Range (HDR) imaging. They take a burst of photos at different brightness levels and stitch them together. It works well for standard scenes—usually dealing with exposure differences of 3 to 4 “stops.” But what happens when the difference is extreme—say, 9 stops? Or when things are moving fast in the frame? ...

[USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting 🔗](https://arxiv.org/abs/2411.10504)

How USP-Gaussian Solves the "Cascading Error" Problem in High-Speed 3D Vision

Imagine you are trying to create a 3D model of a scene using a camera mounted on a high-speed train or a racing drone. Traditional cameras fail here—they suffer from massive motion blur due to fixed exposure times. This is where spike cameras come in. Inspired by the biological retina, these sensors capture light as a continuous stream of binary spikes (0s and 1s) at frequencies up to 40,000 Hz, theoretically eliminating motion blur. ...

[UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units 🔗](https://arxiv.org/abs/2505.09393)

Motion Capture Without Cameras: How UMotion Fuses Uncertainty, Physics, and AI

Introduction For decades, accurate 3D human motion capture (MoCap) was restricted to Hollywood studios and high-end research labs. It required a controlled environment, dozens of cameras, and actors wearing suits covered in reflective markers. In recent years, the focus has shifted to “in-the-wild” motion capture—tracking movement anywhere, from a living room to a hiking trail, using wearable sensors. The most common solution involves Inertial Measurement Units (IMUs)—the same sensors found in your smartphone or smartwatch that track acceleration and rotation. ...

[UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models 🔗](https://arxiv.org/abs/2412.11441)

The Invisible Trojan: Understanding UIBDiffusion and the Future of AI Security

The Invisible Trojan: Understanding UIBDiffusion and the Future of AI Security Generative AI has fundamentally changed how we create digital content. At the forefront of this revolution are Diffusion Models (DMs), the engines behind tools like Stable Diffusion and DALL-E, which can conjure photorealistic images from simple text prompts. These models are powerful, but their strength relies on massive datasets scraped from the web. This reliance on external data creates a massive security vulnerability: Data Poisoning. ...

[UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning 🔗](https://arxiv.org/abs/2506.07087)

Breaking Camouflage: How UCOD-DPL Masters Unsupervised Detection with Dynamic Learning

Introduction In nature, survival often depends on the ability to disappear. From the leaf-tailed gecko blending into tree bark to the arctic hare vanishing into the snow, camouflage is a sophisticated evolutionary biological mechanism designed to evade predators. In the world of Computer Vision, replicating the predator’s ability to spot these hidden creatures is known as Camouflaged Object Detection (COD). COD is significantly harder than standard object detection. The targets share similar textures, colors, and patterns with the background, making boundaries incredibly difficult to discern. While fully supervised deep learning methods have made strides in this area, they come with a heavy cost: they require massive datasets with pixel-perfect human annotations. Labeling a camouflaged object is laborious and expensive because the objects are, by definition, hard to see. ...

[Type-R: Automatically Retouching Typos for Text-to-Image Generation 🔗](https://arxiv.org/abs/2411.18159)

Type-R: How AI Can Finally Spell Correctly in Generated Images

Type-R: How AI Can Finally Spell Correctly in Generated Images If you have ever played with text-to-image models like Stable Diffusion, DALL-E 3, or Flux, you are likely familiar with a very specific frustration. You type a prompt asking for a cool cyber-punk poster that says “FUTURE,” and the model generates a breathtaking image… with the text “FUTRE,” “FUTUUE,” or perhaps some alien hieroglyphics that look vaguely like English. While generative AI has mastered lighting, texture, and composition, it is notoriously bad at spelling. This phenomenon, often called the “spaghetti text” problem, renders many generated images unusable for professional graphic design without heavy manual editing. ...

[Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks 🔗](https://arxiv.org/abs/2407.21121)

Taming the Sine Waves: A Deep Dive into Robust Training for Implicit Neural Representations

If you have been following the cutting edge of computer vision and signal processing, you have likely encountered Implicit Neural Representations (INRs). Unlike satisfyingly discrete grids of pixels or voxels, INRs represent data (like images, 3D shapes, or audio) as a continuous mathematical function, usually approximated by a neural network. The current superstar of INRs is the Sinusoidal Neural Network, popularized by the SIREN architecture. Instead of standard ReLU activations, these networks use sine waves. They are mathematically elegant and capable of capturing incredible high-frequency details. ...

[Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better 🔗](https://arxiv.org/abs/2503.19904)

Stop the Flicker: How Tracktention Uses Point Tracking to Master Video Consistency

If you have ever tried to use a single-image AI model to process a video frame-by-frame, you are likely familiar with the “flicker” problem. Whether it is depth estimation, style transfer, or colorization, applying an image model to a video usually results in a jittery, inconsistent mess. The ground shakes, colors shift randomly, and objects morph shape from one second to the next. This happens because standard image models have no concept of time. They don’t know that the chair in frame 10 is the same object as the chair in frame 11. ...

[Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models 🔗](https://arxiv.org/abs/2502.07601)

Can AI Spot the Defect? Inside Anomaly-OneVision, the Specialist Visual Assistant

Imagine you are a quality control inspector on a factory line. Thousands of components pass by every hour. Your job isn’t just to spot a broken part; you have to explain why it’s broken. Is it a scratch? A dent? Is the soldering messy? Now, imagine trying to teach an AI to do this. While modern Multimodal Large Language Models (MLLMs) like GPT-4o are incredible at describing a sunset or reading a menu, they struggle significantly when asked to find a microscopic crack in a screw or a slight discoloration on a medical scan. They lack the “specialist” eye required for Anomaly Detection (AD). ...

[Towards Scalable Human-aligned Benchmark for Text-guided Image Editing 🔗](https://arxiv.org/abs/2505.00502)

Beyond "Looks Good": How HATIE Automates Human-Like Evaluation for Image Editing

Introduction We are living in the golden age of generative AI. With the advent of diffusion models, we can conjure vivid worlds from a single sentence. But as the technology matures, the focus is shifting from simple generation (creating an image from scratch) to editing (modifying an existing image). Imagine you have a photograph of a living room and you want to “add a vase to the table” or “change the dog into a cat.” It sounds simple, but evaluating whether an AI model has done this job well is notoriously difficult. ...

[Towards RAW Object Detection in Diverse Conditions 🔗](https://arxiv.org/abs/2411.15678)

Why Robots Should See in RAW: Unlocking Object Detection in Extreme Weather

When you snap a photo with your smartphone, a massive amount of processing happens instantly. The sensor captures a raw signal, but before it reaches your screen, an Image Signal Processor (ISP) compresses it, adjusts the colors, balances the white, and tone-maps the shadows. The result is an sRGB image—optimized for the human eye. But here is the critical question for computer vision researchers: Is an image optimized for human vision actually good for machine vision? ...

[Towards In-the-wild 3D Plane Reconstruction from a Single Image 🔗](https://arxiv.org/abs/2506.02493)

ZeroPlane: Bridging the Gap Between Indoor and Outdoor 3D Plane Reconstruction

When we look at the world, we don’t just see pixels; we see structure. We instinctively recognize the floor we walk on, the walls that surround us, and the roads we drive on. In computer vision, these structures are known as 3D planes. Recovering these planes from a single 2D image is a cornerstone capability for Augmented Reality (AR), robotics navigation, and 3D mapping. However, there has been a significant fragmentation in the field. Current state-of-the-art (SOTA) methods are typically “specialists”—they are trained on indoor datasets to reconstruct rooms, or outdoor datasets to reconstruct streets. If you take a model trained on a cozy living room and ask it to interpret a city street, it usually fails. This lack of generalizability is known as the domain gap. ...

[Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text 🔗](https://arxiv.org/abs/2503.01261)

A Picture is Worth a Thousand Words: How Long-Text Alignment Revolutionizes Image Generation

Introduction The idiom “a picture is worth a thousand words” suggests that complex imagery conveys meaning more effectively than a brief description. However, in the world of Artificial Intelligence—specifically Vector Quantization (VQ) based image modeling—we have historically been feeding our models the equivalent of a few mumbled words and expecting them to understand a masterpiece. Current state-of-the-art image generation models often rely on a “codebook”—a library of discrete features learned from images. To improve these codebooks, researchers have recently started aligning them with text captions. The logic is sound: if the codebook understands the semantic link between the visual “cat” and the word “cat,” the generation quality improves. ...

[Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhou_Towards_Explainable_and_Unprecedented_Accuracy_in_Matching_Challenging_Finger_Crease_CVPR_2025_paper.pdf)

Cracking the Code of Knuckles: A New Framework for Explainable, Cross-Pose Biometrics

Introduction: The Hidden Evidence in Our Hands In the realm of forensic science, every pixel counts. Consider the Victim Identification Programme within the Department of Homeland Security. They process millions of images and videos related to child abuse cases, searching for any clue to identify perpetrators. Often, the suspect’s face is hidden, and the only visible evidence is a hand holding a device or an object. This is where finger knuckle biometrics steps into the spotlight. Unlike fingerprints, which require a surface touch, knuckle patterns are clearly visible in standard photographs. However, automated identification has historically hit a wall. While recent AI has become adept at matching high-quality, straight-on images of hands, it fails spectacularly when the finger is bent, rotated, or captured from a distance—the exact scenarios found in real-world surveillance and forensic evidence. ...

[Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency 🔗](https://arxiv.org/abs/2312.04831)

Beyond the Blur: How ASUKA Fixes Hallucinations and Color Shifts in Generative Inpainting

Image inpainting—the art of filling in missing or damaged parts of an image—has undergone a revolution with the advent of generative AI. Models like Stable Diffusion and FLUX can miraculously reconstruct missing scenery or remove unwanted objects. However, if you have experimented with these tools, you have likely encountered two frustrating phenomena: the model inserting a random, bizarre object where there should be empty space, or the filled-in area having a slightly different color tone than the rest of the image, looking like a “smear.” ...

[Towards Autonomous Micromobility through Scalable Urban Simulation 🔗](https://arxiv.org/abs/2505.00690)

Simulating the Sidewalk: How URBAN-SIM Scales Autonomous Micromobility

Introduction Imagine ordering a coffee or a small package to be delivered to your doorstep. In a futuristic city, a small robot navigates the chaotic urban jungle—dodging pedestrians, climbing curbs, and weaving through park benches—to bring you that item. This concept is known as micromobility. While we often hear about autonomous cars on highways, the “last mile” of autonomy—sidewalks, plazas, and public spaces—presents a radically different set of challenges. Unlike cars, which operate on structured lanes with clear rules, micromobility robots must handle “unstructured” environments. They face stairs, grass, uneven cobblestones, dense crowds, and unpredictable obstacles. ...