Papers

[RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins 🔗](https://arxiv.org/abs/2504.13059)

RoboTwin: How Generative AI Creates Digital Twins to Train Dual-Arm Robots

Introduction In the world of robotics, we often marvel at videos of robots performing backflips or dancing. But ask a robot to coordinate two hands to place a pair of shoes neatly into a box, and you will likely see it struggle. Dual-arm coordination is the next frontier in robotic manipulation. While single-arm tasks (like picking and placing an object) have seen massive success, bimanual (two-armed) manipulation introduces exponential complexity. The arms must avoid colliding with each other, coordinate handovers, and handle objects that are too large or awkward for a single gripper. ...

[RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training 🔗](https://arxiv.org/abs/2411.17662)

RoboPEPP: Teaching AI to See Robot Poses Through Physical Intuition

Imagine a robot arm working in a bustling kitchen or on a manufacturing floor. To collaborate safely with humans or other machines, this robot needs to know exactly where it is in space relative to the camera watching it. This is known as robot pose estimation. Usually, this is straightforward because we can cheat: we ask the robot’s internal motor encoders what its joint angles are. But what if we can’t trust those sensors? What if we are observing a robot we don’t control? Or what if we simply want a purely vision-based redundancy system? ...

[Revisiting MAE pre-training for 3D medical image segmentation 🔗](https://arxiv.org/abs/2410.23132)

Spark3D: How Simple Masked Autoencoders are Revolutionizing 3D Medical Imaging

In the fast-paced world of Deep Learning, we often look for the “next big thing”—a new Transformer architecture, a complex loss function, or a revolutionary optimizer. However, sometimes the most significant breakthroughs come not from inventing something entirely new, but from taking a simple, powerful idea and engineering it to perfection. This is exactly what the authors of the paper “Revisiting MAE pre-training for 3D medical image segmentation” have achieved. They took the concept of Masked Autoencoders (MAEs)—a technique that has dominated natural language processing and computer vision—and rigorously adapted it for 3D medical imaging. ...

[Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition 🔗](https://arxiv.org/abs/2411.18941)

Can AI Tell the Difference Between Writing and Typing? How ProtoGCN Uses Prototypes to Master Fine-Grained Action Recognition

Introduction Imagine watching a silhouette of a person sitting at a desk. Their arms are moving. Are they writing a letter, or are they typing on a keyboard? To the casual observer, and indeed to many computer vision algorithms, these two actions look remarkably similar. The posture is the same; the active body parts (arms and hands) are the same. The difference lies in the subtle, fine-grained details of how those joints move relative to one another. ...

[Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhong_Rethinking_Personalized_Aesthetics_Assessment_Employing_Physique_Aesthetics_Assessment_as_An_CVPR_2025_paper.pdf)

Beyond the Average: A New Paradigm for Personalized AI Aesthetics

Introduction “Beauty is in the eye of the beholder.” It is a phrase we have heard a thousand times, implying that aesthetic judgment is inherently subjective. Yet, in the world of Computer Vision and Artificial Intelligence, we have spent years teaching machines to understand beauty by averaging the opinions of the masses. This approach, known as Generic Aesthetics Assessment (GAA), works well for determining if a photo is generally “high quality”—is it in focus? Is the lighting good? Is the composition standard? ...

[Relative Pose Estimation through Affine Corrections of Monocular Depth Priors 🔗](https://arxiv.org/abs/2501.05446)

Fixing the Shift: How Affine Corrections Unlock Monocular Depth for Camera Pose Estimation

Introduction In the world of computer vision, we are currently witnessing a golden age of Monocular Depth Estimation (MDE). Thanks to deep learning and massive datasets, modern neural networks can look at a single, flat 2D image and predict a surprisingly accurate dense depth map. Models like MiDaS, Marigold, and Depth Anything have transformed 2D images into pseudo-3D representations. However, there is a gap between generating a pretty 3D visualization and actually using that data for rigorous geometric tasks. One of the most fundamental tasks in computer vision is Relative Pose Estimation: determining the rotation and translation between two cameras that took pictures of the same scene. Traditionally, this is done by matching feature points between images and using epipolar geometry. ...

[Reference-Based 3D-Aware Image Editing with Triplanes 🔗](https://arxiv.org/abs/2404.03632)

Copy-Paste in 3D: Mastering Reference-Based Editing with Triplanes

Imagine you are editing a photo of a person. You want to give them the specific hairstyle from a celebrity photo you found online. In 2D image editing (like Photoshop or standard generative AI), this is becoming increasingly easy. But what if that person isn’t just a flat image? What if you are building a video game avatar, a VR experience, or a movie scene where the character needs to turn their head? ...

[Reconstructing People, Places, and Cameras 🔗](https://arxiv.org/abs/2412.17806)

HSfM: Unifying Humans and Structure from Motion for a Metric 3D World

Imagine you have a handful of photos taken by different people at a party. You want to recreate that exact moment in 3D: the room layout, where everyone was standing, and where the photographers were standing. In computer vision, this is a classic divide. We have excellent algorithms for reconstructing the static scene (walls, furniture), known as Structure from Motion (SfM). We also have great models for reconstructing humans (poses, body shapes). But historically, these two fields have been like oil and water. SfM algorithms treat moving humans as “noise” to be filtered out, while human reconstruction methods often generate “floating” avatars with no sense of the floor or the room scale. ...

[Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach 🔗](https://arxiv.org/abs/2503.08306)

Inside the Robot's Mind: How End-to-End Agents Learn Physics and Planning

Introduction Imagine teaching a robot to navigate a crowded office. In the past, this was a modular problem: one piece of software built a map, another located the robot on that map, and a third calculated a path. Today, the cutting edge of Embodied AI uses “end-to-end” (E2E) reinforcement learning. You feed the robot visual data (pixels), and it outputs motor commands (action). It’s a “black box” approach that has yielded impressive results in simulation. ...

[Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval 🔗](https://arxiv.org/abs/2412.11077)

Thinking Before Searching: How One-Stage Reflective Reasoning Solves Composed Image Retrieval

Imagine you are shopping online. You see a photo of a woman wearing a stunning floral dress, but you’d prefer it in a solid red color. You can’t just upload the image because the search engine will give you the floral dress again. You can’t just type “red dress” because you’ll lose the specific cut and style of the original image. This problem—combining a reference image with a textual modification to find a target image—is known as Composed Image Retrieval (CIR). It is one of the most practical yet challenging frontiers in computer vision. ...

[Realistic Test-Time Adaptation of Vision-Language Models 🔗](https://arxiv.org/abs/2501.03729)

Anchoring Your Model: How StatA Makes Vision-Language Adaptation Realistic

Vision-Language Models (VLMs) like CLIP have revolutionized computer vision. By aligning images and text during pre-training, they allow us to perform “zero-shot” classification—identifying objects the model has never explicitly seen during training, simply by matching them to a textual description. But here is the catch: while zero-shot performance is impressive, it often isn’t enough for high-stakes, real-world applications. The distribution of data in the wild rarely matches the pristine conditions of training sets. To bridge this gap, researchers look to Test-Time Adaptation (TTA) and Transductive Learning. These techniques tweak the model on the fly using the incoming test data itself. ...

[Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs 🔗](https://arxiv.org/abs/2504.12909)

Beyond the Single Brain—How Distributed MLPs Are Revolutionizing Real-Time Human Avatars

Creating photorealistic digital humans that can move and react in real-time is one of the “holy grail” challenges in computer graphics. Whether for the Metaverse, video games, or virtual reality telepresence, we want avatars that look real—down to the wrinkles on a shirt—and render at high frame rates. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful alternative to meshes and NeRFs (Neural Radiance Fields). However, current methods face a frustrating trade-off: you can either have a fast avatar with blurry details or a high-fidelity avatar that runs at a sluggish 10 frames per second (FPS). ...

[Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures 🔗](https://arxiv.org/abs/2412.13183)

Double Trouble: How 'Double Unprojected Textures' Solves Real-Time Human Rendering

Imagine video calling a friend, but instead of staring at a flat 2D rectangle on your phone, you are looking at a photo-realistic 3D hologram of them. You can walk around them, see the back of their shirt, or watch them dance from any angle. This is the “Holy Grail” of telepresence and the metaverse. For years, achieving this required Hollywood-style motion capture studios with 50+ cameras and hours of processing time. But a new paper titled “Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures” is changing the game. ...

[ReNeg: Learning Negative Embedding with Reward Guidance 🔗](https://arxiv.org/abs/2412.19637)

Beyond 'Bad Hands': How ReNeg Automates Negative Prompts for Better AI Art

Introduction If you have ever played with text-to-image (T2I) models like Stable Diffusion, you are likely familiar with the frustration of “prompt engineering.” You type a beautiful description, only to get an image with distorted faces, extra fingers, or a gloomy color palette. To fix this, the community developed a workaround: Negative Prompts. By typing words like “ugly, bad anatomy, low quality, blurry” into a negative prompt box, users tell the model what not to generate. While effective, this process is fundamentally a guessing game. It relies on trial and error, intuition, and copying “magic words” from other users. Why should we manually guess the right negative words when we can use AI to learn the perfect negative representation mathematically? ...

[ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Sun_ROLL_Robust_Noisy_Pseudo-label_Learning_for_Multi-View_Clustering_with_Noisy_CVPR_2025_paper.pdf)

When Data Lies: Robust Multi-View Clustering in a Noisy World

When Data Lies: Robust Multi-View Clustering in a Noisy World In the ideal world of machine learning research, data is clean, labels are accurate, and every input matches its description perfectly. In the real world, however, data is messy. Sensors glitch, annotators make mistakes, and datasets are full of noise. Imagine you are training an AI to understand scenes using two “views”: an image from a camera and a caption from a text file. In a perfect dataset, a picture of a sheep is always paired with the text “A sheep on the grass.” But what happens if the data pipeline gets crossed? What if the picture of the sheep is paired with the text “A yellow boat on the beach”? ...

[RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness 🔗](https://arxiv.org/abs/2405.17220)

Can AI Teach Itself to Be Honest? Inside the RLAIF-V Framework

The rise of Multimodal Large Language Models (MLLMs)—AI that can see images and talk about them—has been nothing short of revolutionary. Models like GPT-4V and LLaVA have demonstrated an uncanny ability to understand the visual world. However, they share a persistent, critical flaw: hallucination. You have likely seen it happen. You show a model a picture of a kitchen, and it confidently describes a blender that isn’t there. Or you upload a chart, and it invents numbers that don’t exist. These “confident lies” make deploying these models in high-stakes environments risky. ...

[RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars 🔗](https://arxiv.org/abs/2503.12886)

Real-Time Digital Twins: How RGBAvatar Revolutionizes Head Avatar Reconstruction

The dream of the “digital twin”—a photorealistic, animatable avatar that looks and moves exactly like you—has long been a holy grail in computer graphics. Whether for the metaverse, video games, or next-generation telepresence, the demand for high-fidelity head avatars is skyrocketing. However, creating these avatars typically involves a painful trade-off. You can have high quality, or you can have speed, but rarely both. Traditional methods might require expensive studio setups, while recent neural rendering techniques often take hours (or even days) of training time to learn a single face. ...

[Question-Aware Gaussian Experts for Audio-Visual Question Answering 🔗](https://arxiv.org/abs/2503.04459)

Taming Temporal Dynamics: How QA-TIGER Revolutionizes Audio-Visual Question Answering

Imagine you are watching a video of an orchestra. A friend asks, “Which instrument started playing first?” To answer this, your brain performs a complex feat. You don’t just look at a few random snapshots; you perceive the continuous flow of time. You don’t just listen to the audio as a whole; you isolate specific sounds and synchronize them with visual movements. Most importantly, you know exactly what to look and listen for before you even process the scene because the question guides your attention. ...

[QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers 🔗](https://arxiv.org/abs/2503.19718)

Beyond QUBO: How QuCOOP Unlocks Quantum Annealing for Computer Vision

If you have been following the development of Quantum Computing, you likely know that we are in the “NISQ” era (Noisy Intermediate-Scale Quantum). While universal, fault-tolerant quantum computers are still on the horizon, we currently have access to Quantum Annealers (like those from D-Wave). These machines are specialized hardware designed to find the lowest energy state of a system, making them potentially powerful tools for combinatorial optimization. However, there is a catch. Modern quantum annealers effectively speak only one language: QUBO (Quadratic Unconstrained Binary Optimization). ...

[Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness 🔗](https://arxiv.org/abs/2503.09487)

Fixing Spurious Correlations in AI: A Deep Dive into Project-Probe-Aggregate (PPA)

Introduction: The “Waterbird” Problem Imagine you are training an AI to classify birds. You feed it thousands of images of waterbirds (like ducks) and landbirds (like sparrows). The model achieves 99% accuracy on your validation set. You are ready to deploy. But then, disaster strikes. You show the model a duck standing on a grassy field, and it confidently shouts “Landbird!” You show it a sparrow flying over a lake, and it predicts “Waterbird!” ...