[Structure from Collision 🔗](https://arxiv.org/abs/2505.21335)

Cracking the Shell - How Collisions Reveal Invisible Internal Structures in NeRFs

Imagine looking at a pristine, opaque billiard ball sitting on a table. Now, imagine a ping-pong ball painted to look exactly like that billiard ball sitting next to it. To a camera—and to standard computer vision algorithms—these two objects are identical. They share the same geometry and the same surface texture. However, if you were to drop both balls, their true nature would instantly reveal itself. The solid billiard ball would land with a heavy thud, barely deforming. The hollow ping-pong ball would bounce, vibrate, and deform upon impact. The motion betrays the structure. ...

2025-05 · 11 min · 2169 words
[SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving 🔗](https://arxiv.org/abs/2411.15482)

SplatFlow: Mastering Dynamic Scene Reconstruction Without Bounding Boxes

Introduction The race toward fully autonomous driving relies heavily on one critical resource: data. While real-world driving logs are invaluable, they are finite and often fail to capture the “long tail” of rare, dangerous edge cases. This is where simulation steps in. If we can create photorealistic, physics-compliant digital twins of the real world, we can train and test autonomous vehicles (AVs) in infinite variations of complex scenarios. However, reconstructing a dynamic urban environment from sensor data is notoriously difficult. Modern techniques like Neural Radiance Fields (NeRFs) and the more recent 3D Gaussian Splatting (3DGS) have revolutionized static scene reconstruction. They can render buildings and parked cars with breathtaking fidelity. But put a moving truck in the frame, and things fall apart. The moving object often appears as a ghostly, blurred trail, or artifacts corrupt the static background. ...

2024-11 · 10 min · 1948 words
[SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Tang_SpecTRe-GS_Modeling_Highly_Specular_Surfaces_with_Reflected_Nearby_Objects_by_CVPR_2025_paper.pdf)

SpecTRe-GS: Bringing Realistic Mirrors and Reflections to 3D Gaussian Splatting

If you have been following the rapid advancements in 3D computer vision, you have undoubtedly encountered 3D Gaussian Splatting (3DGS). It has revolutionized the field by offering real-time rendering speeds coupled with high-quality reconstruction. However, like any burgeoning technology, it has its Achilles’ heel. For 3DGS, that weakness is mirrors and shiny objects. Standard 3DGS struggles to render highly specular surfaces—materials like polished metal, glass, or mirrors that reflect their surroundings sharply. When you look at a mirror in a 3DGS scene, you often see a blurry, incoherent mess rather than a crisp reflection of the nearby teapot or toy car. ...

9 min · 1763 words
[SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models 🔗](https://arxiv.org/abs/2505.00788)

Beyond 2D: Teaching Large Multimodal Models to Understand 3D Space with SpatialLLM

Introduction Imagine you are crossing a busy street. You see a white van and a cyclist. Your brain instantly processes not just what these objects are, but where they are in three-dimensional space and where they are going. You instinctively know that the van is facing you (potentially dangerous) while the cyclist is moving parallel to you. This is 3D spatial reasoning, a capability so fundamental to human cognition that we rarely think about it. ...

2025-05 · 8 min · 1552 words
[Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_Spatial457_A_Diagnostic_Benchmark_for_6D_Spatial_Reasoning_of_Large_CVPR_2025_paper.pdf)

Can AI Really See in 3D? Inside Spatial457, the Benchmark Exposing 6D Reasoning Gaps

Introduction We are currently witnessing a golden age of Large Multimodal Models (LMMs). Systems like GPT-4o and Gemini have demonstrated an uncanny ability to interpret visual scenes, describe objects in poetic detail, and answer questions about images with human-like fluency. If you show these models a picture of a busy street, they can list the cars, the pedestrians, and the color of the traffic light. But there is a subtle, yet critical, difference between identifying what is in an image and understanding where it is in physical space. ...

8 min · 1585 words
[SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding 🔗](https://arxiv.org/abs/2504.05576)

Hearing the Unseen: How SoundVista Synthesizes 3D Audio from Visual Cues

Imagine walking through a virtual museum or a digital twin of a historical site. The visuals are photorealistic, thanks to recent advances in 3D reconstruction and NeRF (Neural Radiance Fields) technology. But close your eyes, and the illusion often breaks. The sound might be flat, static, or incorrectly spatialized. While we have mastered Novel-View Synthesis for eyes (creating new visual angles from sparse photos), Novel-View Acoustic Synthesis (NVAS)—generating accurate sound for a specific location in a room based on recordings from other spots—remains a massive challenge. Real-world sound is messy. Ambient noise isn’t just a single speaker; it’s the hum of a refrigerator, the distant traffic outside, the reflection of footsteps off a concrete wall, and the muffling effect of a sofa. ...

2025-04 · 8 min · 1511 words
[Sonata: Self-Supervised Learning of Reliable Point Representations 🔗](https://arxiv.org/abs/2503.16429)

Beyond Geometric Shortcuts: How Sonata Revolutionizes 3D Self-Supervised Learning

Introduction In the world of 2D computer vision, we are currently living in a golden age of self-supervised learning (SSL). Models like DINO and MAE have demonstrated that neural networks can learn robust, semantically rich representations of images without needing a single human-annotated label. You can take a pre-trained image model, freeze its weights, add a simple linear classifier on top (a process called “linear probing”), and achieve results that rival fully supervised training. ...

2025-03 · 10 min · 1978 words
[SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning 🔗](https://arxiv.org/abs/2412.04077)

Preserving World Knowledge: How SoMA Optimizes 'Minor' Components for Domain Generalization

Imagine training an autonomous vehicle in sunny California. The car performs flawlessly, detecting pedestrians, other vehicles, and traffic signs with high precision. Then, you ship that same car to London during a rainy, foggy night. Suddenly, the system falters. The “domain shift”—the difference between the sunny training data and the rainy real-world environment—causes the model to fail. This is the core challenge of Domain Generalization (DG): How do we build models that learn on one specific domain (source) but perform robustly on unseen, unpredictable domains (target)? ...

2024-12 · 7 min · 1462 words
[SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training 🔗](https://arxiv.org/abs/2412.09619)

SnapGen: How to Run High-Res Text-to-Image on Your Phone

Introduction The generative AI boom has brought us incredible tools like Stable Diffusion XL (SDXL) and Stable Diffusion 3 (SD3). These models can conjure photorealistic images from simple text prompts, but they come with a heavy cost: computational power. Typically, running these models requires massive, power-hungry GPUs found in cloud servers or high-end gaming PCs. For the average user, this means relying on cloud services, which introduces latency, subscription costs, and data privacy concerns. Running these models locally on a smartphone has been the “holy grail” of edge computing. While there have been attempts to shrink models through compression, they often result in low-resolution outputs (\(512 \times 512\) pixels) or significant drops in visual quality. ...

2024-12 · 8 min · 1495 words
[SmartCLIP: Modular Vision-language Alignment with Identification Guarantees 🔗](https://arxiv.org/abs/2507.22264)

Why CLIP Misses Details: Introducing SmartCLIP and Modular Alignment

If you have experimented with modern AI art generators or image search engines, you have likely interacted with CLIP (Contrastive Language-Image Pre-training). Since its release, CLIP has become the backbone of multimodal AI, serving as the bridge that allows computers to understand images through text. However, despite its massive success, CLIP has a fundamental problem: it struggles with details. If you describe a complex scene, CLIP tends to mash all the concepts together into a single entangled representation. Conversely, if you use a short caption, CLIP often discards visual information that isn’t explicitly mentioned in the text. ...

2025-07 · 8 min · 1609 words
[SkillMimic: Learning Basketball Interaction Skills from Demonstrations 🔗](https://arxiv.org/abs/2408.15270)

Hoops in the Matrix: How SkillMimic Teaches Physics-Based Characters to Play Basketball

Hoops in the Matrix: How SkillMimic Teaches Physics-Based Characters to Play Basketball If you’ve ever played a sports video game, you know that while the graphics look realistic, the underlying animation is often just a “playback” of a recorded motion. But in the world of robotics and physics-based simulation, we want something different: we want a digital character that actually “learns” to move its muscles to perform a task, adhering to the laws of physics. ...

2024-08 · 7 min · 1456 words
[Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision 🔗](https://arxiv.org/abs/2411.03745)

Simulator HC: How to "Cheat" at Math with AI to Solve Complex Geometric Vision Problems

Simulator HC: How to “Cheat” at Math with AI to Solve Complex Geometric Vision Problems If you have ever dabbled in 3D computer vision—building systems for Structure-from-Motion (SfM), Visual SLAM, or camera calibration—you know that at the bottom of every cool visualization lies a bedrock of nasty mathematics. Specifically, we often have to solve systems of polynomial equations. For decades, the field has relied on purely algebraic methods to solve these equations. While elegant, these methods struggle when problems get too complex, involving high degrees or many variables. They become computationally heavy and numerically unstable. ...

2024-11 · 9 min · 1789 words
[SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment 🔗](https://arxiv.org/abs/2503.09594)

SimLingo: Teaching Autonomous Cars to 'Dream' of Actions for Better Driving

Introduction For decades, the “holy grail” of autonomous driving has been a vehicle that doesn’t just navigate from point A to point B, but one that truly understands the world and can communicate with its passengers. We’ve seen incredible progress in Large Language Models (LLMs) that can reason about complex topics, and separate progress in autonomous driving systems that can navigate city streets. However, merging these two worlds has proven difficult. ...

2025-03 · 9 min · 1916 words
[Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models 🔗](https://arxiv.org/abs/2502.20134)

Show AND Tell: Bridging the Gap Between Heatmaps and Concepts in AI Explainability

Introduction: The “Black Box” Problem Imagine you are a doctor using an AI system to diagnose an X-ray. The AI predicts “Pneumonia” with 95% confidence. As a responsible practitioner, your immediate question isn’t just “Is it correct?” but rather “Why?” If the AI points to a specific shadow on the lung (the “Where”) but doesn’t tell you what it sees, you might be left guessing. Conversely, if the AI says it detects “fluid accumulation” (the “What”) but doesn’t tell you where it is, you can’t verify if it’s looking at the lung or an artifact in the background. ...

2025-02 · 9 min · 1813 words
[Shape Abstraction via Marching Differentiable Support Functions 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Park_Shape_Abstraction_via_Marching_Differentiable_Support_Functions_CVPR_2025_paper.pdf)

Beyond Boxes and Meshes: How Differentiable Support Functions are Revolutionizing 3D Shape Abstraction

Introduction In the world of computer vision and robotics, how a machine “sees” an object is just as important as the object itself. Imagine a robot trying to pick up a coffee mug. To us, it’s a simple cup. To a computer, it might be a dense cloud of millions of points, a heavy triangular mesh, or a complex neural radiance field. While these detailed representations are great for rendering pretty images, they are often terrible for physics. calculating a collision between a million-polygon mesh and a robotic hand is computationally expensive. This is where Shape Abstraction comes in. It simplifies complex objects into manageable primitives—like building a Lego version of a sculpture. ...

9 min · 1759 words
[Seurat: From Moving Points to Depth 🔗](https://arxiv.org/abs/2504.14687)

How Moving Dots Reveal the 3D World: A Deep Dive into Seurat

Introduction How do you know how far away an object is? If you close one eye and sit perfectly still, the world flattens. Depth perception becomes a guessing game based on shadows and familiar object sizes. But the moment you move your head, the world pops back into 3D. Nearby objects rush past your vision, while distant mountains barely budge. This phenomenon, known as motion parallax, is a fundamental way biological systems perceive geometry. ...

2025-04 · 11 min · 2155 words
[Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Baade_Self-Supervised_Cross-View_Correspondence_with_Predictive_Cycle_Consistency_CVPR_2025_paper.pdf)

Where is that Mug? Teaching AI to Match Objects Across Extreme Perspectives Without Labels

Introduction Imagine you are trying to teach a robot how to cook by having it watch a video of a human chef. The robot has its own camera (first-person, or “egocentric” view), but it is also watching a surveillance camera in the corner of the kitchen (third-person, or “exocentric” view). The human picks up a blue cup. To imitate this, the robot needs to know that the blue shape in the corner camera corresponds to the same object as the blue shape in its own camera. ...

9 min · 1726 words
[Seeing more with less: human-like representations in vision models 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Gizdov_Seeing_More_with_Less_Human-like_Representations_in_Vision_Models_CVPR_2025_paper.pdf)

Seeing More with Less: How Foveated Vision Optimizes AI Models

The human eye is a marvel of biological engineering, but it is also surprisingly economical. We do not perceive the world in uniform high definition. Instead, we possess a fovea—a small central region of high acuity—surrounded by a periphery that progressively blurs into low resolution. This mechanism allows us to process complex scenes efficiently, allocating limited biological resources (photoreceptors and optic nerve bandwidth) where they matter most. In contrast, modern Computer Vision (CV) and Large Multimodal Models (LMMs) are brute-force processors. They typically ingest images at a uniform, high resolution across the entire Field of View (FOV). While effective, this approach is computationally expensive and bandwidth-heavy. ...

8 min · 1629 words
[SeedVR: Seeding Infinity in Diffusion Transformer Toward Generic Video Restoration 🔗](https://arxiv.org/abs/2501.01320)

SeedVR: Breaking the Speed and Resolution Limits of Video Restoration

Video restoration is a classic computer vision problem with a modern twist. We all have footage—whether it’s old family home movies, low-quality streams, or AI-generated clips—that suffers from blur, noise, or low resolution. The goal of Generic Video Restoration (VR) is to take these low-quality (LQ) inputs and reconstruct high-quality (HQ) outputs, recovering details that seem lost to time or compression. Recently, diffusion models have revolutionized this field. By treating restoration as a generative task, they can hallucinate realistic textures that traditional methods blur out. However, this power comes at a steep price: computational cost. ...

2025-01 · 8 min · 1677 words
[SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks 🔗](https://arxiv.org/abs/2503.06965)

Bridging the Gap Between Sky and Ground: A Deep Dive into SeCap for Cross-View Person Re-ID

Introduction In the evolving landscape of intelligent surveillance, we are witnessing a convergence of two distinct worlds: the ground and the sky. Traditional security systems rely heavily on CCTV cameras fixed at eye level or slightly above. However, the rapid proliferation of Unmanned Aerial Vehicles (UAVs), or drones, has introduced a new vantage point. This combination offers comprehensive coverage, but it introduces a massive computational headache known as Aerial-Ground Person Re-Identification (AGPReID). ...

2025-03 · 10 min · 1977 words