Papers

[Zero-Shot Monocular Scene Flow Estimation in the Wild 🔗](https://arxiv.org/abs/2501.10357)

Taming the Wild - A New Standard for Zero-Shot Monocular Scene Flow

Introduction Imagine you are looking at a standard video clip. It’s a 2D sequence of images. Your brain, processing this monocular (single-eye) view, instantly understands two things: the 3D structure of the scene (what is close, what is far) and the motion of objects (where things are moving in that 3D space). For computer vision models, replicating this human intuition is an incredibly difficult task known as Monocular Scene Flow (MSF). While we have seen massive leaps in Artificial Intelligence regarding static depth estimation or 2D optical flow, estimating dense 3D motion from a single camera remains an elusive frontier. ...

[VGGT: Visual Geometry Grounded Transformer 🔗](https://arxiv.org/abs/2503.11651)

One Pass to Rule Them All: Understanding VGGT for Instant 3D Reconstruction

Introduction For decades, the field of computer vision has chased a specific “Holy Grail”: taking a handful of flat, 2D photos scattered around a scene and instantly transforming them into a coherent 3D model. Traditionally, this process—known as Structure-from-Motion (SfM)—has been a slow, mathematical grind. It involves detecting features, matching them across images, solving complex geometric equations to find camera positions, and then running iterative optimization algorithms like Bundle Adjustment to refine everything. While effective, it is computationally expensive and often brittle. ...

[UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming 🔗](https://arxiv.org/abs/2307.16375)

Breaking the Distributed Bottleneck: How UniAP Unifies Parallel Training Strategies

If you have ever tried to train a massive Large Language Model (LLM) like Llama or a vision giant like ViT, you know the struggle: a single GPU simply doesn’t cut it. To train these behemoths, we need distributed learning across clusters of GPUs. But here is the catch: simply having a cluster isn’t enough. You have to decide how to split the model. Do you split the data? Do you split the layers? Do you split the tensors inside the layers? ...

[The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition 🔗](https://arxiv.org/abs/2502.21201)

Can AI See the Chimp for the Trees? Mitigating Background Bias in Wildlife Monitoring

Introduction Imagine you are training a computer vision model to recognize a chimpanzee climbing a tree. You feed it thousands of hours of video footage. The model achieves high accuracy, and you are thrilled. But then, you test it on a video of an empty forest with no chimpanzee in sight, and the model confidently predicts: “Climbing.” Why does this happen? The model has fallen into a trap known as shortcut learning. Instead of learning the complex motion of the limbs or the texture of the fur, the model took the path of least resistance: it learned that “vertical tree trunks” usually equal “climbing.” It memorized the background, not the behavior. ...

[TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion 🔗](https://arxiv.org/abs/2504.11773)

TacoDepth: Breaking the Speed Limit in Radar-Camera Depth Estimation

In the rapidly evolving world of autonomous driving and robotics, perception is everything. Vehicles need to know not just what is around them, but exactly how far away it is. While LiDAR sensors provide excellent depth data, they are expensive. A more cost-effective alternative is fusing data from cameras (rich visual detail) and mmWave Radar (reliable depth and velocity). However, Radar-Camera fusion has a major bottleneck: efficiency. Existing methods are often slow and computationally heavy, relying on complex, multi-stage processes that act like stumbling blocks for real-time applications. ...

[Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models 🔗](https://arxiv.org/abs/2501.01423)

Breaking the Trade-off: How Aligning VAEs with Foundation Models Supercharges Diffusion Training

Introduction In the rapidly evolving world of generative AI, Latent Diffusion Models (LDMs) like Stable Diffusion and Sora have become the gold standard for creating high-fidelity images and videos. These models work their magic by not operating on pixels directly, but rather in a compressed “latent space.” This compression is handled by a component called a Visual Tokenizer, typically a Variational Autoencoder (VAE). For a long time, the assumption was simple: if we want better images, we need better tokenizers. Specifically, we assumed that increasing the capacity (dimensionality) of the tokenizer would allow it to capture more details, which would, in turn, allow the diffusion model to generate more realistic images. ...

[Navigation World Models 🔗](https://arxiv.org/abs/2412.03572)

Can Robots Dream of Walking? Understanding Navigation World Models

Introduction How do you navigate a crowded room to reach the exit? You likely don’t just stare at your feet and react to obstacles the moment they touch your toes. Instead, you project a mental simulation. You imagine a path, predict that a person might step in your way, and adjust your trajectory before you even take a step. You possess an internal model of the world that allows you to simulate the future. ...

[Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models 🔗](https://arxiv.org/abs/2409.17146)

Breaking the Cycle of Distillation: How Molmo Builds State-of-the-Art VLMs from Scratch

Introduction In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) have become ubiquitous. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet can describe complex images, interpret charts, and answer questions about the visual world with startling accuracy. However, these proprietary models are “walled gardens.” We interact with them via APIs, but we don’t know exactly how they were built or what data they were trained on. ...

[MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos 🔗](https://arxiv.org/abs/2412.04463)

Taming the Chaos: How MegaSaM Solves 3D Reconstruction for Casual, Dynamic Videos

Imagine you are holding your smartphone, recording a video of your friend running down a beach or a car racing around a track. To you, the scene is clear. But to a computer trying to reconstruct that scene in 3D, it is a nightmare. For decades, Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM) algorithms have relied on two golden rules: the scene must be static (rigid), and the camera must move enough to create parallax (the effect where close objects move faster than far ones). Casual videos break both rules constantly. We have moving objects, we rotate cameras without moving our feet, and we film scenes where “dynamic” elements (like people or cars) dominate the view. ...

[FoundationStereo: Zero-Shot Stereo Matching 🔗](https://arxiv.org/abs/2501.09898)

FoundationStereo: Bringing Zero-Shot Generalization to Stereo Depth Estimation

FoundationStereo: Bringing Zero-Shot Generalization to Stereo Depth Estimation In the rapid evolution of computer vision, we have seen “Foundation Models” transform how machines understand images. Models like Segment Anything (SAM) or DepthAnything have demonstrated an incredible ability to generalize: they can perform tasks on images they have never seen before without needing specific fine-tuning. However, one corner of computer vision has lagged behind in this zero-shot revolution: Stereo Matching. Stereo matching—the process of estimating depth by comparing two images taken from slightly different viewpoints—has historically relied on training deep networks on specific datasets. A model trained on driving scenes (like KITTI) usually fails when tested on indoor scenes (like Middlebury). It’s a classic case of overfitting to the domain. ...

[Descriptor-In-Pixel : Point-Feature Tracking for Pixel Processor Arrays 🔗](https://openaccess.thecvf.com/content/CVPR2025/papers/Bose_Descriptor-In-Pixel__Point-Feature_Tracking_For_Pixel_Processor_Arrays_CVPR_2025_paper.pdf)

Smart Sensors: How Computing Inside the Pixel enables 3000 FPS Feature Tracking

Smart Sensors: How Computing Inside the Pixel enables 3000 FPS Feature Tracking Computer vision has a bottleneck problem. In a traditional setup—whether it’s a smartphone, a VR headset, or a drone—the camera sensor acts as a “dumb” bucket. It captures millions of photons, converts them to digital values, and then sends a massive stream of raw data to an external processor (CPU or GPU) to figure out what it’s looking at. ...

[DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models 🔗](https://arxiv.org/abs/2503.01774)

Cleaning Up the Mess: How Single-Step Diffusion is Revolutionizing 3D Reconstruction

Introduction We are currently witnessing a golden age of neural rendering. Technologies like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have allowed us to turn a handful of 2D photographs into immersive, navigable 3D scenes. The results are often breathtaking—until you stray too far from the original camera path. As soon as you move the virtual camera to a “novel view”—an angle not seen during training—the illusion often breaks. You encounter “floaters” (spurious geometry hanging in the air), blurry textures, and ghostly artifacts. This happens because these regions are underconstrained; the 3D model simply doesn’t have enough data to know what should be there, so it guesses, often poorly. ...

[Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World 🔗](https://arxiv.org/abs/2505.04788)

Solved via Relaxation - A New Global Approach to Vanishing Point Estimation

If you look down a long, straight hallway or stare at a skyscraper from the street, you intuitively understand perspective. Parallel lines in the real world—like the edges of a ceiling or the sides of a building—appear to converge at a specific spot in the distance. In Computer Vision, these are called Vanishing Points (VPs). Locating these points is crucial for tasks like camera calibration, 3D reconstruction, and autonomous navigation. In structured environments (like cities or indoors), we often rely on the Manhattan World assumption, which posits that the world is built on three mutually orthogonal axes (up-down, left-right, forward-backward). ...

[3D Student Splatting and Scooping 🔗](https://arxiv.org/abs/2503.10148)

Beyond Gaussians: How Student's t-Distribution and Negative Density Revolutionize Neural Rendering

Introduction In the rapidly evolving world of computer graphics and computer vision, few techniques have made as much noise recently as 3D Gaussian Splatting (3DGS). It offered a brilliant alternative to Neural Radiance Fields (NeRFs), allowing for real-time rendering of complex scenes by representing them as millions of 3D Gaussian ellipses. It was fast, high-quality, and explicit. But as with any foundational technology, once the dust settled, researchers began to ask: Is the Gaussian distribution actually the best primitive for the job? ...

[wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 🔗](https://arxiv.org/abs/2006.11477)

Hearing the Unheard: How wav2vec 2.0 Revolutionized Speech Recognition with Self-Supervised Learning

Introduction: The Data Bottleneck In the world of deep learning, data is fuel. For years, the engine of Automatic Speech Recognition (ASR) has been fueled by thousands of hours of transcribed audio—humans painstakingly listening to recordings and typing out every word. While this supervised approach has yielded systems like Siri and Alexa, it has a fundamental flaw: it doesn’t scale. There are approximately 7,000 languages spoken worldwide. For the vast majority of them, collecting thousands of hours of transcribed audio is impossible. Even for major languages, the reliance on labeled data is inefficient. Consider how a human infant learns language. They don’t start by reading transcripts; they start by listening. They learn the structure of speech—the rhythm, the phonemes, the intonation—long before they attach meaning to words. ...

[wav2vec: Unsupervised Pre-training for Speech Recognition 🔗](https://arxiv.org/abs/1904.05862)

wav2vec Explained: How Self-Supervised Learning Revolutionized Speech ASR

The field of Automatic Speech Recognition (ASR) has long struggled with a “data hunger” problem. To build a system that understands human speech effectively—like Siri or Alexa—you historically needed thousands of hours of audio that had been painstakingly transcribed by humans. This labeled data is expensive, slow to produce, and often unavailable for low-resource languages. Meanwhile, in the world of Natural Language Processing (NLP), models like BERT were shattering records by reading massive amounts of unlabeled text to learn the structure of language before ever seeing a specific task. ...

[EXPLORING AND MITIGATING ADVERSARIAL MANIPULATION OF VOTING-BASED LEADERBOARDS 🔗](https://arxiv.org/abs/2501.07493)

Is Chatbot Arena Broken? How Adversaries Can Game LLM Leaderboards

Introduction In the rapidly evolving world of Artificial Intelligence, keeping score is hard. Traditional benchmarks—static lists of questions like the SATs or coding problems—are quickly becoming obsolete. Large Language Models (LLMs) are simply getting too smart for them, or worse, they have memorized the answers from their training data. To solve this, the AI community has turned to the “wisdom of the crowd.” Platforms like Chatbot Arena have become the gold standard for evaluating model performance. The premise is simple and elegant: pit two anonymous models against each other, have a human ask a question, and let the human vote on which answer is better. It feels fair, unbiased, and representative of real-world usage. ...

[Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination 🔗](https://arxiv.org/abs/2504.12714)

Beyond Self-Play: How Changing Environments Teaches AI to Cooperate with Anyone

Imagine you are a chef who has perfected a specific soup recipe with your sous-chef. You know exactly when they will chop the onions, and they know exactly when you will stir the broth. You move like a well-oiled machine. Now, imagine you step into a stranger’s kitchen. The layout is different, the stove is in a weird spot, and your new partner chops vegetables at a completely different pace. ...

[Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation 🔗](https://arxiv.org/abs/2503.01776)

Sparse is the New Dense: How CSR Beats Matryoshka for Adaptive Embeddings

In the era of Retrieval-Augmented Generation (RAG) and massive vector databases, the quality and efficiency of embeddings—those numerical vector representations of data—are paramount. We want embeddings that are rich in semantic meaning but also lightweight enough to search through millions of records in milliseconds. For a long time, the industry standard has been dense representations. Recently, Matryoshka Representation Learning (MRL) gained popularity (even being adopted by OpenAI) for its ability to create “nested” embeddings that can be truncated to different lengths. However, MRL comes with a heavy price: it requires expensive full-model retraining and often suffers from significant accuracy drops when the vectors are shortened. ...

[VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models 🔗](https://arxiv.org/abs/2502.02492)

Teaching AI to Move: How VideoJAM Solves the Motion Problem in Generative Video

Introduction The field of generative AI has moved at a breakneck pace. We have gone from blurry, postage-stamp-sized GIFs to high-definition, cinematic video generation in a matter of years. Models like Sora, Kling, and Gen-3 can render lighting, textures, and compositions that are nearly indistinguishable from reality. However, there is a catch. While these models have mastered appearance (what things look like), they often fail spectacularly at motion (how things move). ...