[InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing 🔗](https://arxiv.org/abs/2505.24315)

How AI Learns to Grasp: Inside InteractAnything

Imagine you are building a virtual world. You have a 3D model of a chair and a 3D model of a human. Now, you want the human to sit on the chair. In traditional animation, this is a manual, tedious process. You have to drag the character, bend their knees, ensure they don’t clip through the wood, and place their hands naturally on the armrests. Now, imagine asking an AI to “make the person sit on the chair,” and it just happens. ...

2025-05 · 9 min · 1854 words
[INTERMIMIC: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions 🔗](https://arxiv.org/abs/2502.20390)

Mastering the Physics of Interaction: How InterMimic Teaches Virtual Humans to Handle the Real World

Introduction In the world of computer animation and robotics, walking is a solved problem. We can simulate bipedal locomotion with impressive fidelity. However, as soon as you ask a virtual character to interact with the world—pick up a box, sit on a chair, or push a cart—the illusion often breaks. Hands float inches above objects, feet slide through table legs, or the character simply flails and falls over. This is the challenge of Physics-Based Human-Object Interaction (HOI). Unlike standard animation, where characters move along predefined paths (kinematics), physics-based characters must use virtual muscles (actuators) to generate forces. They must balance, account for friction, and manipulate dynamic objects that have mass and inertia. ...

2025-02 · 8 min · 1495 words
[Instruction-based Image Manipulation by Watching How Things Move 🔗](https://arxiv.org/abs/2412.12087)

InstructMove - How Watching Videos Teaches AI to Perform Complex Image Edits

InstructMove: How Watching Videos Teaches AI to Perform Complex Image Edits The field of text-to-image generation has exploded in recent years. We can now conjure hyper-realistic scenes from a simple sentence. However, a significant challenge remains: editing. Once an image is generated (or if you have a real photo), how do you change specific elements—like making a person smile or rotating a car—without destroying the rest of the image’s identity? ...

2024-12 · 9 min · 1807 words
[Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting 🔗](https://arxiv.org/abs/2503.16979)

Real-Time Holography? How Instant Gaussian Stream Transforms Dynamic 3D Video

The dream of “Holographic” communication—where you can view a remote event from any angle in real-time—has long been a staple of science fiction. In computer vision, this is known as Free-Viewpoint Video (FVV). The goal is to reconstruct dynamic 3D scenes from multiple camera feeds instantly. While recent technologies like 3D Gaussian Splatting (3DGS) have revolutionized how we render static scenes, handling dynamic scenes (videos where people and objects move) remains a massive computational bottleneck. Traditional methods require processing the entire video offline, which is useless for live interactions like virtual meetings or sports broadcasting. Even existing “streaming” methods often take over 10 seconds to process a single frame, creating an unacceptable lag. ...

2025-03 · 7 min · 1481 words
[Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning 🔗](https://arxiv.org/abs/2503.00513)

Bridging the Gap: How Inst3D-LMM Teaches AI to Understand 3D Scenes Like Humans

Imagine asking a robot to “pick up the red mug next to the laptop.” To us, this is a trivial request. To an AI, it is a geometric and semantic nightmare. The AI must identify objects in 3D space, understand what “red” and “mug” look like, and figure out the spatial relationship “next to.” While Large Language Models (LLMs) have mastered text, and Vision-Language Models (VLMs) have conquered 2D images, 3D scene understanding remains a frontier filled with challenges. Most current approaches awkwardly stitch together 2D image data and 3D point clouds, often losing the fine-grained details that make a scene coherent. They struggle to understand how objects relate to one another in physical space and are notoriously computationally expensive. ...

2025-03 · 9 min · 1711 words
[Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models 🔗](https://arxiv.org/abs/2411.14432)

Can AI See and Think? Unpacking Insight-V's Multi-Agent Visual Reasoning

Introduction In the rapid evolution of Artificial Intelligence, we have witnessed a shift from models that simply predict the next word to models that can solve complex logic puzzles. With the release of systems like OpenAI’s o1, text-based Large Language Models (LLMs) have demonstrated “System 2” thinking—the ability to deliberate, reason step-by-step, and self-correct before answering. However, there is a glaring gap in this progress: Vision. While Multimodal Large Language Models (MLLMs)—models that can see and talk—have become excellent at describing images (perception), they often struggle when asked to perform complex reasoning about what they see. If you show an AI a chart and ask for a deep economic analysis, or a geometric figure and ask for a multi-step proof, it frequently hallucinates or takes a shortcut to the wrong answer. ...

2024-11 · 11 min · 2272 words
[IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera 🔗](https://arxiv.org/abs/2410.08107)

Seeing the Unseen: Incremental 3D Reconstruction with Event Cameras and Gaussian Splatting

Imagine a drone flying at high speed through a dimly lit tunnel. A standard camera would likely fail in this scenario; the fast motion causes severe blur, and the low light results in grainy, unusable footage. This is the bottleneck for many robotics applications today. However, there is a different kind of sensor that thrives in exactly these conditions: the Event Camera. Event cameras are bio-inspired sensors that work differently than the cameras in our phones. Instead of capturing full frames at a fixed rate, they operate asynchronously, detecting changes in brightness at the pixel level. This gives them incredible advantages: microsecond latency, high dynamic range, and no motion blur. ...

2024-10 · 10 min · 1948 words
[InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment 🔗](https://arxiv.org/abs/2503.18454)

Aligning Diffusion Models Faster and Better: A Deep Dive into InPO

If you have ever played with Text-to-Image (T2I) models like Stable Diffusion, you know the struggle: you type a prompt, get a weird result, tweak the prompt, get a slightly less weird result, and repeat. While these models are powerful, they aren’t naturally aligned with human aesthetic preferences or detailed instruction following. In the world of Large Language Models (LLMs), we solved this using Reinforcement Learning from Human Feedback (RLHF) and, more recently, Direct Preference Optimization (DPO). These methods take “winning” and “losing” outputs and teach the model to prefer the winner. ...

2025-03 · 6 min · 1196 words
[Improving Personalized Search with Regularized Low-Rank Parameter Updates 🔗](https://arxiv.org/abs/2506.10182)

Teaching Old Models New Tricks—Personalized Search with POLAR

Imagine you have a digital photo album containing thousands of images. You want to find a picture of your specific pet, “Fido,” catching a frisbee. You type “Fido catching a frisbee” into the search bar. Standard Vision-Language Models (VLMs) like CLIP differ from standard object detectors because they can understand open-ended text. However, they have a major limitation: they know what a dog looks like, but they don’t know what your dog, Fido, looks like. ...

2025-06 · 9 min · 1832 words
[Improving Gaussian Splatting with Localized Points Management 🔗](https://arxiv.org/abs/2406.04251)

Fixing the Flaws in Gaussian Splatting with Localized Point Management

The arrival of 3D Gaussian Splatting (3DGS) marked a paradigm shift in neural rendering. Unlike Neural Radiance Fields (NeRFs), which rely on expensive ray marching through implicit volumes, 3DGS utilizes explicit point clouds—specifically, 3D Gaussians—to render scenes in real-time with photorealistic quality. However, despite its speed and visual fidelity, 3DGS has a “messy room” problem. The quality of the final render is heavily dependent on how the Gaussian points are distributed. If the initialization (usually via Structure from Motion) is poor, or if the optimization process fails to place points where they are needed, the model suffers from artifacts. You might see “floaters” (random blobs floating in space), blurred details in complex geometry, or erroneous depth estimations. ...

2024-06 · 9 min · 1735 words
[Implicit Correspondence Learning for Image-to-Point Cloud Registration 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_Implicit_Correspondence_Learning_for_Image-to-Point_Cloud_Registration_CVPR_2025_paper.pdf)

Beyond Matching: How Implicit Learning Solves Image-to-Point Cloud Registration

Beyond Matching: How Implicit Learning Solves Image-to-Point Cloud Registration Imagine you are a robot navigating a city. You have a pre-built 3D map of the city (a point cloud), and you just took a picture with your onboard camera. To know where you are, you need to figure out exactly where that 2D picture fits inside that massive 3D world. This problem is known as Image-to-Point Cloud Registration. It sounds simple in theory—just line up the picture with the 3D model—but in practice, it is incredibly difficult. Why? Because you are trying to match two completely different types of data: a 2D grid of pixels (the image) and an unordered set of spatial coordinates (the point cloud). ...

10 min · 1995 words
[ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yang_ImagineFSL_Self-Supervised_Pretraining_Matters_on_Imagined_Base_Set_for_VLM-based_CVPR_2025_paper.pdf)

Dreaming of Data: How ImagineFSL Revolutionizes Few-Shot Learning with Synthetic Pretraining

Dreaming of Data: How ImagineFSL Revolutionizes Few-Shot Learning with Synthetic Pretraining In the world of deep learning, data is the fuel that powers the engine. But what happens when that fuel runs low? This is the core challenge of Few-Shot Learning (FSL)—teaching a model to recognize new concepts with only one or a handful of examples. Recently, Vision-Language Models (VLMs) like CLIP have shown incredible promise in this area. However, adapting these massive models to specific tasks with tiny datasets remains a hurdle. The community’s recent answer has been Generative AI. If we don’t have enough data, why not just generate it using Text-to-Image (T2I) models like Stable Diffusion? ...

8 min · 1641 words
[Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays 🔗](https://arxiv.org/abs/2312.02971)

Solving the Ghosting Problem - How to Reconstruct Images from Multiplexed Single-Photon Detectors

Solving the Ghosting Problem: How to Reconstruct Images from Multiplexed Single-Photon Detectors Imagine trying to take a picture in near-total darkness, where light is so scarce that you are counting individual particles—photons—as they hit the sensor. This is the realm of single-photon detectors. These devices are revolutionizing fields ranging from biological imaging and Lidar to quantum optics. Among these, Superconducting Nanowire Single-Photon Detectors (SNSPDs) are the gold standard. They offer incredible efficiency and precision. However, they have a major scaling problem: they need to be cooled to cryogenic temperatures. If you want a megapixel camera, you can’t easily run a million wires out of a deep-freeze cryostat without introducing too much heat. ...

2023-12 · 9 min · 1726 words
[Image Quality Assessment: From Human to Machine Preference 🔗](https://arxiv.org/abs/2503.10078)

Why AI Hates Your "Perfect" Image: The Shift to Machine-Centric Quality Assessment

Have you ever compressed an image so much that it looked blocky and pixelated, yet your phone still recognized the face in it perfectly? Conversely, have you ever taken a photo that looked fine to you, but your smart camera refused to focus or detect the object? For decades, the field of image processing has been obsessed with one question: “Does this look good to a human?” We built compression algorithms (like JPEG), cameras, and restoration filters designed to please the Human Visual System (HVS). But the world has changed. According to recent data, Machine-to-Machine (M2M) connections have surpassed Human-to-Machine connections. Today, the primary consumer of visual data isn’t you or me—it’s Artificial Intelligence. ...

2025-03 · 9 min · 1731 words
[ImViD: Immersive Volumetric Videos for Enhanced VR Engagement 🔗](https://arxiv.org/abs/2503.14359)

Beyond Static VR: Building True Immersion with ImViD and Volumetric Video

The dream of Virtual Reality (VR) has always been the “Holodeck” concept—the ability to step into a digital recording of the real world and experience it exactly as if you were there. You want to be able to walk around, lean in to see details, look behind you, and hear the soundscape change as you move. While we have made massive strides with technologies like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, we hit a wall when it comes to dynamic scenes. Most current datasets are either static (frozen in time), object-centric (looking at a single object from the outside), or silent (no audio). ...

2025-03 · 8 min · 1540 words
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition 🔗](https://arxiv.org/abs/2411.17440)

ConsisID: Cracking the Code of Identity Preservation in AI Video Generation

Imagine you want to direct a short film. You have a script, and you have a photo of your lead actor. In the traditional world, this requires cameras, lighting crews, and days of shooting. In the world of Generative AI, we have moved closer to doing this with just a text prompt. However, if you have ever tried to generate a video of a specific person using standard text-to-video models, you have likely encountered the “Identity Problem.” You might upload a photo of yourself and ask for a video of you playing basketball. The result? A person playing basketball who looks vaguely like you in one frame, like your cousin in the next, and like a complete stranger in the third. ...

2024-11 · 8 min · 1583 words
[ICP: Immediate Compensation Pruning for Mid-to-high Sparsity 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Luo_ICP_Immediate_Compensation_Pruning_for_Mid-to-high_Sparsity_CVPR_2025_paper.pdf)

Squeezing 7B Models onto Consumer GPUs: A Deep Dive into Immediate Compensation Pruning (ICP)

If you have ever tried to run a state-of-the-art Large Language Model (LLM) like Llama-2 or a vision model like Segment Anything (SAM) on a single consumer-grade GPU, you know the struggle. These models are massive. A 7-billion parameter model is often the upper limit of what a decent desktop GPU can handle for inference, let alone fine-tuning. To deploy these models efficiently, we often turn to pruning—the process of removing unnecessary weights to make the model smaller and faster. However, there is a catch. Current “one-shot” pruning methods (which are fast and don’t require expensive retraining) work great when you remove 20% or 30% of the weights. But if you try to push the sparsity to 50% or 70% to significantly reduce the model size, performance collapses. ...

9 min · 1839 words
[ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models 🔗](https://arxiv.org/abs/2503.19902)

Beyond Generation: How ICE Teaches AI to Understand What It Sees

Introduction In the rapidly evolving world of Generative AI, we have become accustomed to a specific direction of flow: Text-to-Image (T2I). You type “a futuristic city made of crystal,” and a diffusion model like Stable Diffusion paints it for you. These models are incredibly powerful, having ingested massive datasets that effectively encode a vast amount of “world knowledge.” They know what a city looks like, they know what crystal looks like, and they know how to combine them. ...

2025-03 · 10 min · 2011 words
[HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis 🔗](https://arxiv.org/abs/2503.16944)

HyperLoRA Explained: Instant Personalized LoRAs Without Fine-Tuning

Introduction In the rapidly evolving world of Generative AI, one desire stands out above almost all others: Personalization. We all want to put ourselves, our friends, or specific characters into new, imagined worlds. Whether it’s seeing yourself as an astronaut, a cyberpunk warrior, or an oil painting, the goal is high fidelity (it looks exactly like you) and high editability (you can change the background, lighting, and style). For a long time, we have been stuck between two extremes to achieve this: ...

2025-03 · 9 min · 1879 words
[HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset 🔗](https://arxiv.org/abs/2412.02317)

Automating Animation: How HumanRig Bridges the Gap Between AI 3D Generation and Motion

Introduction We are currently witnessing a “Cambrian Explosion” in the world of 3D content generation. With the advent of text-to-image and image-to-3D models, creating a detailed 3D humanoid character used to take an artist days; now, it takes seconds. But there is a massive bottleneck that sits between a static 3D model and a playable video game character: Rigging. Rigging is the digital equivalent of putting a skeleton inside a statue. It involves defining bones (skeleton construction) and telling the computer which parts of the “skin” (the mesh) should move with which bone (skinning). Without rigging, a 3D model is just a statue—it cannot walk, wave, or dance. ...

2024-12 · 10 min · 2044 words