[FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation 🔗](https://arxiv.org/abs/2412.02690)

Solved: Why AI Struggles with Hands and How FoundHand Fixes It

Solved: Why AI Struggles with Hands and How FoundHand Fixes It If you have ever played with generative AI tools like Midjourney or Stable Diffusion, you have likely encountered the “hand problem.” You prompt for a photorealistic image of a person, and the face looks perfect, the lighting is cinematic, but the hands are a disaster. Extra fingers, impossible joints, or what looks like a bowl of spaghetti made of flesh. ...

2024-12 · 9 min · 1784 words
[ForestLPR: LiDAR Place Recognition in Forests Attending Multiple BEV Density Images 🔗](https://arxiv.org/abs/2503.04475)

Lost in the Woods? How ForestLPR Uses Tree Slices for Robust Robot Localization

Imagine you are hiking in a dense forest. You look around, and all you see are trees—trunks, branches, and leaves that look suspiciously similar to the trees you passed ten minutes ago. Now, imagine coming back to that same spot six months later. The leaves have fallen, the grass has grown, and the lighting is completely different. Could you recognize exactly where you are? This scenario highlights one of the hardest problems in robotics: Place Recognition in natural environments. ...

2025-03 · 8 min · 1656 words
[Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution 🔗](https://arxiv.org/abs/2412.15213)

Rethinking Generative AI: From Gaussian Noise to Direct Cross-Modal Evolution

Introduction For the past few years, the world of Generative AI has been dominated by a single, powerful narrative: Diffusion. Whether you are using DALL-E, Midjourney, or Stable Diffusion, the underlying process is conceptually similar. The model starts with a canvas of pure static (Gaussian noise) and, guided by your text prompt, iteratively denoises it until a coherent image emerges. It is a bit like carving a statue out of a block of marble, where the marble is random noise and the chisel is the text prompt. ...

2024-12 · 9 min · 1768 words
[FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute 🔗](https://arxiv.org/abs/2502.20126)

FlexiDiT: Smarter, Not Harder—Dynamic Compute for Diffusion Transformers

The landscape of generative AI has shifted dramatically with the adoption of Diffusion Transformers (DiTs). Models like Stable Diffusion 3 and Sora have demonstrated that replacing the traditional U-Net backbone with a Transformer architecture leads to scalable, high-fidelity results. However, this performance comes at a steep computational cost. Current diffusion models operate on a static paradigm: they allocate a fixed, heavy amount of compute to every single step of the denoising process. Whether the model is resolving the vague outline of a composition or refining the texture of a cat’s fur, it burns the same number of FLOPs (Floating Point Operations). ...

2025-02 · 7 min · 1488 words
[Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality 🔗](https://arxiv.org/abs/2412.16481)

Aligning Geometry with Hardware: How Flash3D Super-Scales Point Cloud Processing

In the rapidly evolving world of 3D deep learning, we are often forced to choose between two virtues: the geometric precision of the model and the computational efficiency of the hardware. Point clouds—the raw data format generated by LiDAR sensors in autonomous vehicles and robotics—are notoriously difficult to process. Unlike images, which are neat, dense grids of pixels, point clouds are sparse and irregular. To make sense of them, neural networks need to understand the spatial relationships between points (geometric locality). However, Graphics Processing Units (GPUs)—the workhorses of modern AI—prefer data that is dense, contiguous, and predictable (hardware locality). ...

2024-12 · 9 min · 1746 words
[FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement 🔗](https://arxiv.org/abs/2503.04919)

Beyond Bounding Boxes: How FirePlace Teaches AI to Arrange 3D Scenes

Introduction: The “Floating Book” Problem Imagine asking an AI to “place a book on the shelf.” To a human, this is a trivial task. You identify the shelf, find an empty spot on the flat surface, and place the book there upright or stacked. To a standard Multimodal Large Language Model (MLLM), however, this request is fraught with peril. The AI might understand the concept of a shelf and a book, but it lacks a fundamental understanding of 3D geometry. It might place the book floating six inches above the shelf, intersecting through the wood, or balancing precariously on the very edge. Why? Because most AI models treat objects as rough “bounding boxes”—cubes that encompass an object—rather than complex shapes with specific surfaces. If a shelf is treated as a solid box, you can’t put anything inside it. ...

2025-03 · 8 min · 1662 words
[FineVQ: Fine-Grained User Generated Content Video Quality Assessment 🔗](https://arxiv.org/abs/2412.19238)

Beyond the 5-Star Rating: How FineVQ Revolutionizes Video Quality Assessment with Multimodal AI

In the age of TikTok, YouTube Shorts, and Twitch, User-Generated Content (UGC) has become the dominant form of media consumption. Unlike professionally produced films shot on cinema cameras, UGC is wild and unpredictable. It is shot on smartphones, compressed by apps, streamed over spotty 5G connections, and viewed on varying screen sizes. For video platforms, understanding the quality of this content is a billion-dollar problem. If a recommendation algorithm pushes low-quality videos, users leave. However, traditional Video Quality Assessment (VQA) has a major blind spot: it usually reduces a video’s quality to a single scalar score—a “3.5 out of 5.” ...

2024-12 · 8 min · 1509 words
[Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning 🔗](https://arxiv.org/abs/2503.07591)

Slash Your AI Training Costs: A New Paradigm for Visual Instruction Tuning

If you have been following the explosion of Large Vision-Language Models (LVLMs) like LLaVA, GPT-4V, or Gemini, you know that their ability to understand and reason about images is nothing short of impressive. However, behind every capable model lies a massive, expensive bottleneck: Visual Instruction Tuning (VIT). To train these models, researchers compile massive datasets of images paired with complex textual instructions (Question-Answer pairs). Creating these datasets usually involves feeding thousands of images into expensive proprietary models like GPT-4 to generate descriptions and QA pairs. This creates a dilemma for students and researchers with limited budgets: to build a high-quality dataset, you need money. To save money, you often have to settle for lower-quality data. ...

2025-03 · 9 min · 1708 words
[Few-shot Implicit Function Generation via Equivariance 🔗](https://arxiv.org/abs/2501.01601)

Generative AI for Neural Weights: How Symmetry Solves the Few-Shot Problem

In the current landscape of Artificial Intelligence, we are accustomed to models that generate data: pixels for images, tokens for text, or waveforms for audio. But a new frontier is emerging—generating the models themselves. Imagine a system that doesn’t just output a 3D shape, but outputs the neural network weights that represent that shape. This is the promise of Implicit Neural Representations (INRs). INRs use simple Multi-Layer Perceptrons (MLPs) to represent complex continuous signals like 3D objects or gigapixel images. They offer infinite resolution and compact storage. ...

2025-01 · 8 min · 1618 words
[F3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Saha_F3OCUS_-_Federated_Finetuning_of_Vision-Language_Foundation_Models_with_Optimal_CVPR_2025_paper.pdf)

Balancing Act: Optimizing Federated Fine-Tuning for Vision-Language Models with F3OCUS

Introduction In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like LLaVA and BLIP have emerged as powerful tools capable of understanding and generating content based on both visual and textual inputs. These models hold immense promise for specialized fields such as healthcare, where a model might need to analyze a chest X-ray and answer a doctor’s natural language questions about it. However, deploying these massive “foundation models” in the real world presents a paradox. To make them useful for medical diagnosis, they must be fine-tuned on diverse, real-world medical data. Yet, strict privacy regulations (like HIPAA/GDPR) often prevent hospitals from sharing patient data with a central server. ...

9 min · 1908 words
[FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images 🔗](https://arxiv.org/abs/2503.19207)

From Phone Photos to Animatable Avatars in Seconds: Deep Dive into FRESA

Introduction Imagine taking a few quick photos of yourself with your smartphone—front, back, maybe a side profile—and within seconds, having a fully 3D, digital double. Not just a static statue, but a fully rigged, animatable avatar wearing your exact clothes, ready to be dropped into a VR chatroom or a video game. For years, this has been the “holy grail” of 3D computer vision. The reality, however, has been a trade-off between quality and speed. You could either have high-quality avatars generated in a studio with expensive camera rigs (photogrammetry), or you could use neural networks that require hours of optimization per person to learn how a specific t-shirt folds. Neither is scalable for everyday users. ...

2025-03 · 8 min · 1644 words
[FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video 🔗](https://arxiv.org/abs/2503.23094)

Grounding the Avatar: How Geometric Priors and Massive Data Solve Egocentric Motion Capture

Grounding the Avatar: How Geometric Priors and Massive Data Solve Egocentric Motion Capture If you have ever used a modern Virtual Reality (VR) headset, you have likely noticed something missing: your legs. Most current VR avatars are floating torsos with hands, ghosts drifting through a digital void. This isn’t a stylistic choice; it is a technical limitation. Tracking a user’s full body from a headset (egocentric motion capture) is incredibly difficult. The cameras on the headset can barely see the user’s lower body, often blocked by the chest or stomach (self-occlusion). When the cameras do see the legs, the perspective is distorted by fisheye lenses, and the rapid movement of the head makes the video feed chaotic. ...

2025-03 · 9 min · 1718 words
[FICTION: 4D Future Interaction Prediction from Video 🔗](https://arxiv.org/abs/2412.00932)

Beyond 2D: Predicting 'Where' and 'How' Humans Interact in 3D Space

Beyond 2D: Predicting “Where” and “How” Humans Interact in 3D Space Imagine a robot assistant observing you in the kitchen. You are making tea. You’ve just boiled the water. A truly helpful assistant shouldn’t just recognize that you are currently “standing.” It should anticipate that in the next few seconds, you will walk to the cabinet, reach your arm upward to grab a mug, and then move to the fridge to get milk. ...

2024-12 · 8 min · 1629 words
[FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation 🔗](https://arxiv.org/abs/2506.11543)

FIMA-Q: Unlocking Low-Bit Vision Transformers with Smarter Fisher Information Approximation

Introduction Vision Transformers (ViTs) have revolutionized computer vision, challenging the long-standing dominance of Convolutional Neural Networks (CNNs). By leveraging self-attention mechanisms, models like ViT, DeiT, and Swin Transformer have achieved remarkable results in classification and detection tasks. However, this performance comes with a hefty price tag: massive parameter counts and high computational overhead. To deploy these heavy models on edge devices—like smartphones or embedded systems—we need to compress them. The most popular method for this is Post-Training Quantization (PTQ). PTQ converts high-precision floating-point weights (32-bit) into low-precision integers (like 4-bit or 8-bit) without requiring a full, expensive retraining of the model. ...

2025-06 · 8 min · 1543 words
[Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think 🔗](https://arxiv.org/abs/2503.00948)

Motion Modeling is Easier Than You Think: Unlocking Dynamic Video Generation with Model Merging

Image-to-Video (I2V) generation is one of the most exciting frontiers in computer vision. The premise is magical: take a single still photograph—a car on a road, a dog in the grass, a castle on a hill—and breathe life into it. You want the car to drive, the dog to roll over, and the camera to zoom out from the castle. However, if you have played with current I2V diffusion models, you might have encountered a frustrating reality. Often, the “video” is just the input image with a slight wobble, or a “zooming” effect that looks more like a 2D scale than 3D camera movement. Conversely, if the model does generate movement, it often ignores your text prompts completely, creating chaotic motion that has nothing to do with what you asked for. ...

2025-03 · 8 min · 1602 words
[EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Yu_EventPSR_Surface_Normal_and_Reflectance_Estimation_from_Photometric_Stereo_Using_CVPR_2025_paper.pdf)

How Event Cameras are Revolutionizing 3D Material Scanning

Creating realistic “digital twins” of real-world objects is a cornerstone of modern computer graphics, powering everything from movie VFX to immersive VR/AR experiences. To make a digital object look real, you need two things: its shape (surface normal) and its material properties (how shiny or rough it is). For years, this has been a tug-of-war between speed and quality. Traditional methods, like Photometric Stereo (PS), require capturing hundreds of High Dynamic Range (HDR) images under different lights. This is slow, data-heavy, and often fails on “tricky” materials—specifically, objects that are very shiny or metallic. ...

8 min · 1508 words
[Event fields: Capturing light fields at high speed, resolution, and dynamic range 🔗](https://arxiv.org/abs/2412.06191)

Event Fields: When High-Speed Vision Meets Light Field Imaging

Introduction Imagine trying to photograph a bullet speeding through the air. Now, imagine that after you’ve taken the photo, you decide you actually wanted to focus on the target behind the bullet, not the bullet itself. Traditionally, this is impossible. You would need a high-speed camera to freeze the motion, and a light field camera to change the focus. But high-speed cameras are data-hungry beasts, often requiring gigabytes of storage for a few seconds of footage, and light field cameras are notoriously slow or bulky. ...

2024-12 · 9 min · 1837 words
[Event Ellipsometer: Event-based Mueller-Matrix Video Imaging 🔗](https://arxiv.org/abs/2411.17313)

Seeing the Unseen: High-Speed Polarization Video with Event Cameras

Introduction In the world of computer vision, we usually obsess over the intensity of light—how bright or dark a pixel is. But light carries another hidden layer of information: polarization. When light bounces off an object, its electromagnetic orientation changes. These changes encode rich details about the object’s shape, material composition, and surface texture that standard cameras simply miss. To capture this information fully, scientists use Ellipsometry. This technique measures the “Mueller matrix,” a \(4 \times 4\) grid of numbers that completely describes how a material transforms polarized light. It is a powerful tool used in everything from biology to material science. ...

2024-11 · 9 min · 1895 words
[EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events 🔗](https://arxiv.org/abs/2505.04657)

Beyond Frames: How Event Cameras Are Revolutionizing Continuous Video Super-Resolution

1. Introduction We live in a world dominated by video content, yet we are often limited by the hardware that captures it. Most videos are archived at fixed resolutions (like 1080p) and fixed frame rates (usually 30 or 60 fps). But what if you want to zoom in on a distant detail without it becoming a pixelated mess? Or what if you want to slow down a fast-moving action shot without it looking like a slideshow? ...

2025-05 · 9 min · 1748 words
[Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras 🔗](https://arxiv.org/abs/2502.19630)

Seeing the Unseen: How Event Cameras Solve the 'Blind Time' Crisis in Autonomous Driving

Introduction Imagine you are driving down a highway at 60 miles per hour. For a split second, you close your eyes. In that brief moment, the car in front of you slams on its brakes. That split second—where you have no visual information—is terrifying. Now, consider an autonomous vehicle. These systems rely heavily on sensors like LiDAR and standard frame-based cameras. While sophisticated, these sensors have a fundamental limitation: they operate at a fixed frame rate, typically around 10 to 20 Hz. This means there is a gap, often up to 100 milliseconds, between every snapshot of the world. In the research community, this is known as “blind time.” ...

2025-02 · 9 min · 1746 words