[Estimating Body and Hand Motion in an Ego-sensed World 🔗](https://arxiv.org/abs/2410.03665)

EgoAllo: How Smart Glasses Can See Your Whole Body

Introduction Imagine wearing a pair of smart glasses. You are walking through your living room, reaching for a coffee mug, or typing on a keyboard. The glasses have cameras, but they are facing outward to map the world. They can see the mug, the table, and maybe your hands entering the frame. But they can’t see you—or at least, not your torso, legs, or feet. This “invisibility” presents a massive challenge for Augmented Reality (AR) and robotics. If a computer system wants to understand your actions, it needs to know your full body pose. Is the user sitting or standing? Are they leaning forward? Where are their feet planted? ...

2024-10 · 13 min · 2742 words
[Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways 🔗](https://arxiv.org/abs/2503.07026)

Unlearning to See: How EraDiff Teaches Diffusion Models to Erase Objects Properly

Introduction Imagine you have a perfect photo of a pepperoni pizza, but you want to remove just one specific slice to show the wooden plate underneath. You fire up a state-of-the-art AI inpainting tool, mask out the slice, and hit “generate.” Ideally, the AI should generate the texture of the wooden plate. But often, standard diffusion models will do something frustrating: they replace the pepperoni slice with… a cheese slice. Or perhaps a distorted “ghost” of the pepperoni remains. ...

2025-03 · 9 min · 1722 words
[Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wu_Enhanced_Visual-Semantic_Interaction_with_Tailored_Prompts_for_Pedestrian_Attribute_Recognition_CVPR_2025_paper.pdf)

Beyond Static Labels - Tailoring Prompts for Smarter Pedestrian Recognition

Introduction Imagine scanning hours of security footage trying to locate a specific individual. You aren’t just looking for a face; you are looking for descriptors: “a woman wearing a red dress,” “a man with a backpack,” or “someone wearing glasses.” In Computer Vision, this task is known as Pedestrian Attribute Recognition (PAR). For years, this field was dominated by systems that simply looked at an image and tried to guess the tags. However, the rise of Vision-Language Models (like CLIP) has introduced a new paradigm: using text to help the computer “understand” the image better. ...

9 min · 1770 words
[ENERGYMOGEN: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space 🔗](https://arxiv.org/abs/2412.14706)

Mastering Motion: How Energy-Based Models Enable Complex AI Choreography

Introduction Imagine asking an AI to generate an animation of a person “walking forward.” By today’s standards, this is a solved problem. Modern diffusion models can generate a realistic walk cycle in seconds. But what happens if you increase the complexity? What if you ask for a person “walking forward AND waving both hands, but NOT turning around”? This is where standard generative models often stumble. Humans are masters of composition. We can effortlessly blend simple concepts—walking, waving, looking left—into a single, coherent behavior. We can also understand negative constraints (what not to do) just as easily as positive ones. ...

2024-12 · 10 min · 2049 words
[Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_Enduring_Efficient_and_Robust_Trajectory_Prediction_Attack_in_Autonomous_Driving_CVPR_2025_paper.pdf)

How Cardboard Boxes Can Confuse Autonomous Cars: Inside the OMP-Attack

The promise of Autonomous Driving (AD) is built on trust—trust that the vehicle can perceive its environment, predict what others will do, and plan a safe route. But what if a few strategically placed cardboard boxes could shatter that trust? In the world of adversarial machine learning, researchers are constantly probing for weaknesses to build safer systems. A recent paper, Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework, uncovers a significant vulnerability in how self-driving cars predict the movement of other vehicles. The authors introduce a new method, the OMP-Attack, which uses simple physical objects to trick an autonomous vehicle (AV) into slamming on its brakes to avoid a phantom collision. ...

9 min · 1910 words
[End-to-End HOI Reconstruction Transformer with Graph-based Encoding 🔗](https://arxiv.org/abs/2503.06012)

How HOI-TG Solves the Global-Local Conflict in 3D Human-Object Reconstruction

In the rapidly evolving world of Computer Vision, reconstructing 3D humans from 2D images is a well-studied problem. But humans rarely exist in a vacuum. We hold phones, sit on chairs, ride bikes, and carry boxes. When you add objects to the equation, the complexity explodes. This field, known as Human-Object Interaction (HOI) reconstruction, faces a fundamental conflict. To reconstruct a 3D scene, you need to understand the global structure (where the person is relative to the object) and the local details (how the fingers wrap around a handle). Most existing methods struggle to balance these two, often prioritizing one at the expense of the other. ...

2025-03 · 7 min · 1486 words
[Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility 🔗](https://arxiv.org/abs/2505.21377)

Dream3DVG: Bridging the Gap Between Text-to-3D and Vector Graphics

Dream3DVG: Bridging the Gap Between Text-to-3D and Vector Graphics In the world of digital design, vector graphics are the gold standard for clarity and scalability. Unlike pixel-based images (raster graphics), which get blurry when you zoom in, vector graphics are defined by mathematical paths—lines, curves, and shapes—that remain crisp at any resolution. They are the backbone of logos, icons, and conceptual art. However, vector graphics have traditionally been shackled to a 2D plane. If you draw a vector sketch of a car, you can’t simply rotate it to see the back bumper; the drawing is fixed from that specific viewpoint. While recent advancements in AI have enabled “Text-to-3D” generation, applying these techniques to the sparse, abstract world of vector strokes has been notoriously difficult. When you try to force standard 3D generation methods to create line drawings, you often end up with a “tangle of wires”—messy, inconsistent lines that don’t look like a cohesive drawing when rotated. ...

2025-05 · 9 min · 1767 words
[EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing 🔗](https://arxiv.org/abs/2412.08988)

Mastering the Art of Emotional Dubbing: A Deep Dive into EmoDubber

Have you ever watched a dubbed movie where the voice acting felt completely detached from the actor’s face? Perhaps the lips stopped moving, but the voice kept going, or the character on screen was screaming in rage while the dubbed voice sounded mildly annoyed. This disconnect breaks immersion instantly. This challenge falls under the domain of Visual Voice Cloning (V2C). The goal is to take a text script, a video clip of a speaker, and a reference audio track, and then generate speech that matches the video’s lip movements while cloning the reference speaker’s voice. ...

2024-12 · 9 min · 1707 words
[EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision 🔗](https://arxiv.org/abs/2409.02224)

Feeling the Pressure: How EgoPressure Brings Touch to Egocentric Computer Vision

Introduction Imagine playing a piano in Virtual Reality. You see your digital hands hovering over the keys, but when you strike a chord, there is a disconnect. You don’t feel the resistance, and the system struggles to know exactly how hard you pressed. Or consider a robot attempting to pick up a plastic cup; without knowing the pressure it exerts, it might crush the cup or drop it. In the world of computer vision, we have become incredibly good at determining where things are (pose estimation) and what they are (object recognition). However, understanding the physical interaction—specifically touch contact and pressure—remains a massive challenge. This is particularly difficult in “egocentric” vision (first-person perspective), which is the standard view for AR/VR headsets and humanoid robots. ...

2024-09 · 8 min · 1601 words
[Efficient Motion-Aware Video MLLM 🔗](https://arxiv.org/abs/2503.13016)

Stop Sampling Frames: How Compressed Video Structures Make AI Faster and Smarter

Introduction If you have ever tried to build a computer vision system that understands video, you have likely encountered the “sampling dilemma.” Videos are essentially heavy stacks of images. To process a video using a Multimodal Large Language Model (MLLM), the standard approach is Uniform Frame Sampling. You extract one frame every second (or every few frames), encode each one as an image, stack them up, and feed them into the model. ...

2025-03 · 12 min · 2351 words
[EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Rahman_EffiDec3D_An_Optimized_Decoder_for_High-Performance_and_Efficient_3D_Medical_CVPR_2025_paper.pdf)

Cutting the Fat: How EffiDec3D Revolutionizes 3D Medical Image Segmentation

Introduction In the world of medical artificial intelligence, precision is everything. A fraction of a millimeter can distinguish between a benign anomaly and a malignant tumor. Over the last few years, Deep Learning—specifically U-shaped architectures and Vision Transformers—has become the gold standard for automating this segmentation process. However, this precision comes at a steep price. Modern State-of-the-Art (SOTA) models for 3D medical imaging, such as SwinUNETR or 3D UX-Net, are computationally massive. They require expensive GPUs with high memory, making them difficult to deploy in real-time clinical settings or on edge devices like portable ultrasound machines. ...

7 min · 1489 words
[ETAP: Event-based Tracking of Any Point 🔗](https://arxiv.org/abs/2412.00133)

Seeing the Unseen: How Event Cameras are Revolutionizing Point Tracking

Seeing the Unseen: How Event Cameras are Revolutionizing Point Tracking Imagine trying to track a specific point on the blade of a rapidly spinning fan. Or perhaps you are trying to follow a bird diving into a dark shadow. If you use a standard video camera, you will likely run into two major walls: motion blur and dynamic range limitations. The fan blade becomes a smear, and the bird disappears into the darkness. ...

2024-12 · 12 min · 2409 words
[ESC: Erasing Space Concept for Knowledge Deletion 🔗](https://arxiv.org/abs/2504.02199)

True Forgetting: Deleting Deep Learning Knowledge with Erasing Space Concept (ESC)

Introduction In the era of GDPR and increasing privacy concerns, the “right to be forgotten” has become a critical requirement for technology companies. For deep learning, this poses a massive engineering challenge. If a user requests their data be removed from a trained AI model, how do we ensure the model actually “forgets” them? The standard approach is Machine Unlearning (MU). The goal is to update a model to look as if it never saw specific data, without having to retrain the whole thing from scratch (which is expensive and slow). However, recent research reveals a disturbing reality: most current unlearning methods are superficial. They might change the model’s final output, but the sensitive “knowledge” often remains hidden deep within the neural network’s feature extractor. ...

2025-04 · 7 min · 1486 words
[EBS-EKF: Accurate and High Frequency Event-based Star Tracking 🔗](https://arxiv.org/abs/2503.20101)

Navigating the Stars at 1000 Hz: How Event Cameras are Revolutionizing Spacecraft Attitude Control

Introduction For centuries, sailors navigated the open oceans by looking up at the stars. Today, spacecraft orbiting Earth and traveling through the solar system do exactly the same thing. By identifying specific star patterns, a satellite determines its precise orientation in space—known as its “attitude.” This process is handled by a device called a Star Tracker. Standard star trackers use Active Pixel Sensors (APS)—essentially the same technology found in your smartphone camera. They take a picture of the sky, process the frame to find stars, identify them against a catalog, and calculate the angle. While highly accurate, they are slow. Most operate at 2-10 Hz (frames per second). If a satellite needs to spin quickly or stabilize against rapid vibrations, standard star trackers blur the image and fail, leading to a loss of navigation. ...

2025-03 · 8 min · 1599 words
[Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera 🔗](https://arxiv.org/abs/2412.12861)

Untangling the Chaos: How Dyn-HaMR Solves Hand Motion with Moving Cameras

Introduction In the rapidly evolving worlds of Augmented Reality (AR), Virtual Reality (VR), and robotics, understanding human movement is fundamental. We have become quite good at tracking bodies and hands when the camera is sitting still on a tripod. But the real world is dynamic. In egocentric scenarios—like wearing smart glasses or a GoPro—the camera moves with you. This creates a chaotic blend of motion. If a hand appears to move left in a video, is the hand actually moving left, or is the user’s head turning right? ...

2024-12 · 8 min · 1532 words
[DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding 🔗](https://arxiv.org/abs/2504.14920)

How to Teach AI to Focus: A Deep Dive into DyFo and Training-Free Visual Search

Introduction Imagine you are trying to find a specific friend in a crowded stadium. You don’t stare at the entire stadium at once and hope to instantly process every single face. Instead, your eyes dart around. You scan sections, focus on a group of people wearing the right color jersey, zoom in on a specific row, and filter out the noise. This cognitive process is known as visual search, and it is fundamental to how humans interact with the world. We dynamically adjust our focus to filter out irrelevant information and concentrate on what matters. ...

2025-04 · 10 min · 2119 words
[DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction 🔗](https://arxiv.org/abs/2412.04464)

Seeing Double: How Dual Point Maps Revolutionize 3D Object Reconstruction

Reconstructing a 3D object from a single 2D image is one of computer vision’s classic “ill-posed” problems. When you look at a photograph of a galloping horse, your brain instantly understands the 3D shape, the articulation of the limbs, and the parts of the animal that are hidden from view. For a computer, however, inferring this geometry from a grid of pixels is incredibly difficult, especially when the object is deformable—meaning it can bend, stretch, and move (like animals or humans). ...

2024-12 · 8 min · 1571 words
[DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery 🔗](https://arxiv.org/abs/2503.16964)

Taming the Wild: How DroneSplat Brings Robust 3D Reconstruction to Drone Imagery

Introduction Drones have revolutionized how we capture the world. From inspecting massive bridges to surveying urban landscapes and preserving cultural heritage, the ability to put a camera anywhere in 3D space is invaluable. However, turning those aerial photos into accurate, photorealistic 3D models is a computational nightmare, especially when the real world refuses to sit still. Traditional methods like photogrammetry and recent Neural Radiance Fields (NeRFs) have pushed the boundaries of what is possible. More recently, 3D Gaussian Splatting (3DGS) has taken the field by storm, offering real-time rendering speeds that NeRFs could only dream of. But there is a catch: most of these algorithms assume the world is perfectly static and that we have perfect camera coverage. ...

2025-03 · 8 min · 1654 words
[Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map 🔗](https://arxiv.org/abs/2410.23780)

Beyond Geometry: Teaching Autonomous Vehicles to Read Traffic Rules

Imagine you are driving down a busy urban street. You see a lane marked with a solid white line, but overhead, a blue sign indicates “Bus Lane: 7:00-9:00, 17:00-19:00.” You glance at the clock; it’s 10:30 AM. You confidently merge into the lane. This decision-making process—perceiving the geometry of the road, reading a sign, understanding the temporal rule, and associating that rule with a specific lane—is second nature to humans. For autonomous vehicles (AVs), however, it is a surprisingly complex challenge. ...

2024-10 · 8 min · 1655 words
[DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Xu_DriveGPT4-V2_Harnessing_Large_Language_Model_Capabilities_for_Enhanced_Closed-Loop_Autonomous_CVPR_2025_paper.pdf)

Can LLMs Actually Drive? Inside DriveGPT4-V2’s Closed-Loop Control System

Introduction The dream of autonomous driving has been fueled by rapid advancements in Artificial Intelligence. For years, the industry relied on modular pipelines—separate systems for detecting lanes, identifying pedestrians, planning routes, and controlling the steering wheel. However, the field is shifting toward end-to-end learning, where a single neural network takes raw sensor data and outputs driving commands. Simultaneously, we have witnessed the explosion of Large Language Models (LLMs) like GPT-4 and LLaMA. These models possess incredible reasoning capabilities and vast amounts of pretrained knowledge about the world. It begs the question: Can we put an LLM in the driver’s seat? ...

9 min · 1900 words