Papers

Squeezing 7B Models onto Consumer GPUs: A Deep Dive into Immediate Compensation Pruning (ICP)

If you have ever tried to run a state-of-the-art Large Language Model (LLM) like Llama-2 or a vision model like Segment Anything (SAM) on a single consumer-grade GPU, you know the struggle. These models are massive. A 7-billion parameter model is often the upper limit of what a decent desktop GPU can handle for inference, let alone fine-tuning. To deploy these models efficiently, we often turn to pruning—the process of removing unnecessary weights to make the model smaller and faster. However, there is a catch. Current “one-shot” pruning methods (which are fast and don’t require expensive retraining) work great when you remove 20% or 30% of the weights. But if you try to push the sparsity to 50% or 70% to significantly reduce the model size, performance collapses. ...

[ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models 🔗](https://arxiv.org/abs/2503.19902)

Beyond Generation: How ICE Teaches AI to Understand What It Sees

Introduction In the rapidly evolving world of Generative AI, we have become accustomed to a specific direction of flow: Text-to-Image (T2I). You type “a futuristic city made of crystal,” and a diffusion model like Stable Diffusion paints it for you. These models are incredibly powerful, having ingested massive datasets that effectively encode a vast amount of “world knowledge.” They know what a city looks like, they know what crystal looks like, and they know how to combine them. ...

[HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis 🔗](https://arxiv.org/abs/2503.16944)

HyperLoRA Explained: Instant Personalized LoRAs Without Fine-Tuning

Introduction In the rapidly evolving world of Generative AI, one desire stands out above almost all others: Personalization. We all want to put ourselves, our friends, or specific characters into new, imagined worlds. Whether it’s seeing yourself as an astronaut, a cyberpunk warrior, or an oil painting, the goal is high fidelity (it looks exactly like you) and high editability (you can change the background, lighting, and style). For a long time, we have been stuck between two extremes to achieve this: ...

[HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset 🔗](https://arxiv.org/abs/2412.02317)

Automating Animation: How HumanRig Bridges the Gap Between AI 3D Generation and Motion

Introduction We are currently witnessing a “Cambrian Explosion” in the world of 3D content generation. With the advent of text-to-image and image-to-3D models, creating a detailed 3D humanoid character used to take an artist days; now, it takes seconds. But there is a massive bottleneck that sits between a static 3D model and a playable video game character: Rigging. Rigging is the digital equivalent of putting a skeleton inside a statue. It involves defining bones (skeleton construction) and telling the computer which parts of the “skin” (the mesh) should move with which bone (skinning). Without rigging, a 3D model is just a statue—it cannot walk, wave, or dance. ...

[HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yang_HuPerFlow_A_Comprehensive_Benchmark_for_Human_vs._Machine_Motion_Estimation_CVPR_2025_paper.pdf)

Seeing Like a Human: Why AI Needs to Learn Our Visual Illusions

Introduction Imagine you are driving down a highway. Your eyes are constantly scanning the environment, tracking the speed of the car in front of you, the trees rushing past in your peripheral vision, and the slight drift of your own vehicle. You are performing a complex calculation known in computer vision as Optical Flow estimation—determining how pixels move from one moment to the next. For decades, computer vision researchers have been training AI models to master this task. They use “Ground Truth” data—mathematically perfect calculations of where every pixel actually moved. And modern AI is incredible at this; in many cases, it is far more precise than the human eye. ...

[How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions 🔗](https://arxiv.org/abs/2504.12284)

LatentAct: Teaching AI 'How to Do That' by Tokenizing Hand Interactions

Imagine teaching someone how to repair a bicycle. You rarely give them a list of coordinate geometries or vector rotations. Instead, you show them. You demonstrate how the hand should grip the wrench, the specific twisting motion required, and exactly where the fingers need to apply pressure. In the world of robotics and computer vision, this natural form of instruction—demonstrating the “how”—is incredibly difficult to replicate. Most current systems rely on precise 3D models of objects to plan interactions. But what happens when we want an agent to interact with everyday objects—things that are thin, transparent, deformable, or simply don’t have a pre-existing 3D scan? ...

[HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition 🔗](https://arxiv.org/abs/2411.14628)

Turning Up the Heat: Solving the Stability Crisis in Neural 3D Reconstruction

Introduction In the world of 3D computer vision and graphics, representing shapes accurately is half the battle. While point clouds and meshes are classic formats, Implicit Neural Representations have taken the field by storm. Specifically, Neural Signed Distance Functions (SDFs) have become the gold standard for representing watertight, high-fidelity surfaces. An SDF is a mathematical function that tells you, for any point in 3D space, how far you are from the surface of an object. If you are inside the object, the value is negative; if you are outside, it is positive; and if you are exactly on the surface, the value is zero. ...

[Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity 🔗](https://arxiv.org/abs/2412.06171)

Elementary, My Dear Watson: How Holmes-VAU Solves Video Anomalies Like a Detective

Introduction Imagine you are a detective reviewing CCTV footage of a busy city street. Hours of mundane traffic pass by: cars stopping at red lights, pedestrians crossing, rain falling. Suddenly, for three seconds, a car swerves erratically and clips a bus before speeding off. If you were a traditional computer vision model, you might flag a “spike” in an anomaly score at that timestamp. But you wouldn’t necessarily know why. Was it a fight? An explosion? A traffic accident? Furthermore, to understand that this was a “hit-and-run,” you need to watch the moments leading up to the swerve and the aftermath. You need context. ...

[High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model 🔗](https://arxiv.org/abs/2504.01512)

From Flat to Form: How GS-RGBN Masters Single-Image 3D Reconstruction

Introduction One of the most captivating challenges in computer vision is the “Holy Grail” of 3D generation: taking a single, flat photograph of an object and instantly reconstructing a high-fidelity, 3D model that looks good from every angle. Imagine snapping a photo of a toy on your desk and immediately importing it into a video game or a VR environment. While generative AI has made massive strides in 2D image creation, lifting that capability to 3D has proven significantly harder. The core problem is geometric ambiguity. A single image tells you what an object looks like from one specific angle, but it leaves the back, sides, and internal geometry completely up to interpretation. ...

[High-Fidelity Lightweight Mesh Reconstruction from Point Clouds 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhang_High-Fidelity_Lightweight_Mesh_Reconstruction_from_Point_Clouds_CVPR_2025_paper.pdf)

Smart Meshing - How to Reconstruct High-Fidelity Lightweight 3D Models from Point Clouds

Introduction In the world of 3D computer vision and graphics, reconstructing a surface from a point cloud is a fundamental task. Whether you are scanning a room for AR applications or creating assets for a video game, the goal is often the same: take a cloud of disconnected dots and turn them into a watertight, smooth, and detailed 3D mesh. For years, the gold standard for this process has been Marching Cubes (MC). When combined with Neural Implicit Representations—specifically Signed Distance Functions (SDFs)—MC is incredibly reliable. However, it has a significant flaw: it is rigid. MC operates on a fixed resolution grid. If you want high details, you need a high-resolution grid, which generates millions of tiny triangles, leading to massive file sizes and memory usage. If you want a lightweight file, you lower the grid resolution, but you immediately lose sharp edges and fine details. ...

[Hardware-Rasterized Ray-Based Gaussian Splatting 🔗](https://arxiv.org/abs/2503.18682)

High-Fidelity at High FPS — Mastering Hardware-Rasterized Ray-Based Gaussian Splatting

Introduction In the rapidly evolving world of 3D reconstruction and rendering, we are currently witnessing a tug-of-war between two critical factors: speed and quality. On one hand, we have 3D Gaussian Splatting (3DGS), which took the world by storm with its ability to render scenes in real-time using rasterization. On the other hand, we have high-fidelity approaches like Ray-Based Gaussian Splatting (RayGS), which offer superior visual quality—particularly for complex geometries and view-dependent effects—but suffer from computational heaviness that makes them struggle in real-time applications, especially in Virtual Reality (VR). ...

[HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos 🔗](https://arxiv.org/abs/2501.02973)

Unlocking World-Space Hand Motion: How HaWoR Solves Egocentric 3D Reconstruction

Imagine wearing a VR headset or AR glasses. You reach out to grab a virtual cup of coffee. For the experience to feel real, the system needs to know exactly where your hand is—not just in the camera’s view, but in the actual 3D room. This sounds simple, but it is a surprisingly difficult computer vision problem. In “egocentric” (first-person) video, two things are moving at once: your hands and your head (the camera). Traditional methods struggle to separate these motions. If you turn your head left, it looks like your hand moved right. Furthermore, your hands frequently drop out of the camera’s view, causing tracking systems to “forget” where they are. ...

[HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos 🔗](https://arxiv.org/abs/2411.19167)

Beyond Single-View: Inside HOT3D, the New Benchmark for Egocentric Hand-Object Interaction

Introduction The dexterity of the human hand is a defining trait of our species. Whether we are assembling furniture, typing code, or whisking eggs, we constantly interact with the physical world to manipulate objects. For Artificial Intelligence, understanding these interactions is the holy grail of embodied perception. If an AI can truly understand how hands and objects move together in 3D space, we unlock possibilities ranging from teaching robots manual skills to creating augmented reality (AR) interfaces that turn any surface into a virtual keyboard. ...

[HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation 🔗](https://arxiv.org/abs/2411.18335)

360° Vision: Solving Depth Estimation in the Wild with HELVIPAD

Introduction Imagine you are a mobile robot tasked with navigating a crowded university campus. To move safely, you need to know exactly how far away every object is—not just the ones directly in front of you, but the pedestrians approaching from the side, the pillars behind you, and the walls curving around you. You need 360-degree spatial awareness. Historically, the gold standard for this kind of omnidirectional sensing has been LiDAR (Light Detection and Ranging). LiDAR is precise and naturally covers 360 degrees. However, it is expensive, bulky, and the resulting point clouds become sparse at a distance. As a result, researchers have turned to stereo vision: using synchronized cameras to estimate depth, much like human eyes do. ...

[H-MoRe: Learning Human-centric Motion Representation for Action Analysis 🔗](https://arxiv.org/abs/2504.10676)

Beyond Optical Flow: How H-MoRe Revolutionizes Human Motion Analysis

Introduction In the world of computer vision, understanding human movement is a cornerstone task. Whether it’s for healthcare rehabilitation systems, security surveillance, or generating realistic video animations, the computer needs to know not just where a person is, but how they are moving. For years, researchers have relied on two primary tools: Optical Flow (tracking every pixel’s movement) and Pose Estimation (tracking skeleton joints). While both are useful, they have significant flaws. Optical flow is noisy—it tracks blowing leaves and passing cars just as attentively as the human subject. Pose estimation is precise but overly abstract—it reduces a complex human body to a stick figure, losing crucial shape information. ...

[GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Han_GroundingFace_Fine-grained_Face_Understanding_via_Pixel_Grounding_Multimodal_Large_Language_CVPR_2025_paper.pdf)

Beyond the Selfie: How GroundingFace Teaches AI to See Micro-Expressions and Makeup

In the rapidly evolving world of Computer Vision, Multimodal Large Language Models (MLLMs) have achieved something formerly thought impossible: they can look at an image and describe it with near-human fluency. Models like GPT-4V or LLaVA can identify a person in a photo, tell you they are smiling, and perhaps describe their clothing. However, if you ask these general models to identify specific, fine-grained details—like the precise location of “crow’s feet” wrinkles, the style of eyeliner applied, or the exact boundaries of a skin blemish—they often fail. They lack fine-grained grounding, the ability to link specific textual concepts to precise pixels on a high-resolution face. ...

[Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Dampfhoffer_Graph_Neural_Network_Combining_Event_Stream_and_Periodic_Aggregation_for_CVPR_2025_paper.pdf)

Breaking the Speed Limit: How Hybrid Event Graphs Enable Microsecond Optical Flow

Introduction Imagine driving a car at high speed. You rely on your eyes to detect motion instantly. Now, imagine if your brain only processed visual information in snapshots taken every few milliseconds. In that tiny fraction of a blind spot between snapshots, a sudden obstacle could appear, and you wouldn’t react in time. This is the fundamental limitation of traditional, frame-based computer vision. Standard cameras capture the world as a series of still images. To calculate motion—specifically optical flow—algorithms compare one frame to the next. This introduces latency. You cannot detect motion until the next frame is captured and processed. For high-speed robotics, autonomous drones, or safety-critical systems, this delay (often tens of milliseconds) is an eternity. ...

[Gradient-Guided Annealing for Domain Generalization 🔗](https://arxiv.org/abs/2502.20162)

Aligning the Compass: How Gradient-Guided Annealing Solves the Domain Generalization Puzzle

Introduction Imagine you are training a robot to recognize cows. You show it thousands of pictures of cows standing in grassy fields. The robot gets a perfect score during training. Then, you take the robot to a snowy mountain range, show it a cow, and it stares blankly, identifying the object as a “rock.” Why did it fail? Because the robot didn’t learn what a cow looks like; it learned that “green background = cow.” When the green background disappeared (replaced by white snow), the model’s confidence collapsed. ...

[Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion 🔗](https://arxiv.org/abs/2412.00505)

Perception Over Pixels: Solving the Image Compression Trilemma with Wasserstein Distortion

There is an old project management adage that applies frustratingly well to engineering: “Good, Fast, Cheap. Pick two.” In the world of image compression, this “impossible trinity” dictates the limits of our technology. You can have high visual fidelity (Good) and low file size (Cheap), but it will likely require computationally expensive, slow AI models to decode. Conversely, you can have a codec that is lightning fast and produces tiny files (like standard JPEGs at low quality), but the result will look blocky, blurry, and distinctly “digital.” ...

[Goku: Flow Based Video Generative Foundation Models 🔗](https://arxiv.org/abs/2502.04896)

Inside Goku: How Rectified Flow and Joint Training are Revolutionizing Video Generation

Introduction The race for generative video dominance has been one of the most exciting developments in artificial intelligence over the last few years. While diffusion models have become the standard for generating stunning static images, applying them to video—with its added dimension of time—has introduced massive computational bottlenecks and stability issues. Most current video models treat video generation as an extension of image generation, often patching together spatial and temporal attention modules. However, a new contender has emerged from researchers at The University of Hong Kong and ByteDance. Named Goku, this family of models proposes a unified, industry-grade solution that handles both images and videos within a single framework. ...