Papers

[D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2004.07219)

Beyond Online Training: Introducing D4RL for Real-World Offline Reinforcement Learning

The past decade has shown us the incredible power of large datasets. From ImageNet fueling the computer vision revolution to massive text corpora enabling models like GPT, it’s clear: data is the lifeblood of modern machine learning. Yet one of the most exciting fields—Reinforcement Learning (RL)—has largely been excluded from this data-driven paradigm. Traditionally, RL agents learn through active, online interaction with an environment—playing games, controlling robots, simulating trades—building policies through trial and error. This approach is powerful but often impractical, expensive, or dangerous in real-world contexts. We can’t let a self-driving car “explore” by crashing thousands of times or experiment recklessly in healthcare. ...

[Conservative Q-Learning for Offline Reinforcement Learning 🔗](https://arxiv.org/abs/2006.04779)

Learning from the Past: How Conservative Q-Learning Unlocks Offline Reinforcement Learning

Imagine training a robot to cook a meal. The traditional approach in Reinforcement Learning (RL) is trial and error. The robot might try picking up an egg — sometimes succeeding, sometimes dropping it and making a mess. After thousands of attempts, it eventually learns. But what if we already have a massive dataset of a human chef cooking? Could the robot learn just by watching, without ever cracking an egg itself? ...

[NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review 🔗](https://arxiv.org/abs/2210.00379)

NeRF, Gaussian Splatting, and Beyond: A Guided Tour of Neural Radiance Fields

In March 2020, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” introduced a deceptively simple idea that reshaped how we think about 3D scene representation. From a set of posed 2D photos, a compact neural network could learn a continuous, view-consistent model of scene appearance and geometry, then synthesize photorealistic novel views. Over the next five years NeRF inspired a torrent of follow-up work: faster training, better geometry, robust sparse-view methods, generative 3D synthesis, and application-focused systems for urban scenes, human avatars, and SLAM. ...

[Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction 🔗](https://arxiv.org/abs/2302.07817)

TPVFormer: Reconstructing a 3D World from 2D Snapshots with Tri-Perspective View

For an autonomous vehicle to navigate our chaotic world, it needs more than just GPS and rules—it must see and understand its surroundings in rich, 3D detail. Beyond detecting cars and pedestrians, it should recognize the space they occupy, the terrain’s contours, the location of sidewalks, and the canopy of trees overhead. This is the essence of 3D Semantic Occupancy Prediction: building a complete, labeled 3D map of the environment. ...

[Structured 3D Latents for Scalable and Versatile 3D Generation 🔗](https://arxiv.org/abs/2412.01506)

TRELLIS: Weaving High-Quality 3D Worlds with a Unified Latent Structure

Figure 1: High-quality 3D assets generated by TRELLIS in various formats from text or image prompts. Demonstrates versatile generation, vivid appearances with 3D Gaussians or Radiance Fields, detailed geometries with meshes, and flexible editing. The world of AI-generated content has been dominated by stunning 2D imagery. Models like DALL-E and Midjourney can conjure photorealistic scenes and fantastical art from a simple text prompt. But what about the third dimension? ...

[Grounding Image Matching in 3D with MASt3R 🔗](https://arxiv.org/abs/2406.09756)

Beyond Pixels: How MASt3R Grounds 2D Image Matching in 3D Reality

Figure 1: MASt3R predicts dense pixel correspondences even under extreme viewpoint shifts, enabling precise camera calibration, pose estimation, and 3D reconstruction. Image matching is one of the unsung heroes of computer vision. It’s the fundamental building block behind a huge range of applications—from the photogrammetry used to create 3D models in movies and video games, to navigation systems in self-driving cars and robots. The task sounds simple: given two images of the same scene, figure out which pixels in one correspond to which pixels in the other. ...

[LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation 🔗](https://arxiv.org/abs/2402.05054)

LGM: Creating High-Resolution 3D Models in 5 Seconds with Gaussian Splatting

Creating 3D content for games, virtual reality, and films has traditionally been a labor-intensive process, requiring skilled artists and hours of meticulous work. But what if you could generate a highly detailed 3D model from a single image or a line of text in mere seconds? This is the promise of generative AI for 3D — a rapidly evolving field that’s seen explosive growth. Early methods were revolutionary but slow, often taking minutes or even hours to optimize a single 3D asset. More recent feed-forward models brought generation time down to seconds, but at a cost: lower resolution and less geometric detail. The core challenge has been balancing speed and quality. Can we have both? ...

[DUSt3R: Geometric 3D Vision Made Easy 🔗](https://arxiv.org/abs/2312.14132)

How DUSt3R is Redefining 3D Reconstruction — No Camera Info Required

From Photos to 3D Models: A Simpler Path Forward Creating a detailed 3D model from a collection of regular photos has long been considered one of the ultimate goals in computer vision. For decades, the standard approach has been a complex, multi-stage pipeline: first Structure-from-Motion (SfM) to estimate camera parameters and sparse geometry, followed by Multi-View Stereo (MVS) to produce dense surface models. This traditional pipeline is a monumental achievement, underpinning applications from Google Maps’ 3D view to cultural heritage preservation, robotics navigation, and autonomous driving. But it’s also fragile — each step depends on the success of the previous one, and an error at any point can cascade through the pipeline, causing the final reconstruction to fail. Calibration must be precise, the number of views sufficient, motion variations adequate, and surfaces well-textured — otherwise, reconstructions can collapse. ...

[Zero-1-to-3: Zero-shot One Image to 3D Object 🔗](https://arxiv.org/abs/2303.11328)

Zero-1-to-3: How AI Can Imagine a 3D Object from a Single Photo

When you look at a photograph of a car, you don’t just see a flat, two-dimensional collection of pixels. Your brain, drawing on a lifetime of experience, instantly constructs a mental model of a three-dimensional object. You can effortlessly imagine what that car looks like from the side, from the back, or from above, even if you’ve never seen that specific model before. This ability to infer 3D structure from a single 2D view is a cornerstone of human perception. For artificial intelligence, however, it’s a monumental challenge. Traditionally, creating 3D models from images required multiple photographs from different angles, specialized depth-sensing cameras, or vast, expensive datasets of 3D models for training. These methods are powerful but limited; they don’t scale well and often fail on objects they haven’t been explicitly trained on. ...

[A COMPREHENSIVE REVIEW OF YOLO ARCHITECTURES IN COMPUTER VISION: FROM YOLOV1 TO YOLOV8 AND YOLO-NAS 🔗](https://arxiv.org/abs/2304.00501)

From v1 to v8 and Beyond: The Complete Story of YOLO

In the world of computer vision, few algorithms have made an impact as significant and lasting as YOLO (You Only Look Once). From enabling self-driving cars to perceive the world around them to powering automated checkout systems, real-time object detection has become a cornerstone of modern AI. At the heart of this revolution is YOLO—a family of models celebrated for their incredible balance of speed and accuracy. Since its debut in 2015, YOLO has undergone an extraordinary evolution. Each new version has pushed the boundaries of what’s possible, introducing clever architectural changes and novel training techniques. This article will take you on a comprehensive journey through the entire history of YOLO, from the groundbreaking original all the way to the latest state-of-the-art versions like YOLOv8 and the AI-designed YOLO-NAS. ...