Deep Paper

[D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2004.07219)

Beyond Online Training: Introducing D4RL for Real-World Offline Reinforcement Learning

The past decade has shown us the incredible power of large datasets. From ImageNet fueling the computer vision revolution to massive text corpora enabling models like GPT, it’s clear: data is the lifeblood of modern machine learning. Yet one of the most exciting fields—Reinforcement Learning (RL)—has largely been excluded from this data-driven paradigm. Traditionally, RL agents learn through active, online interaction with an environment—playing games, controlling robots, simulating trades—building policies through trial and error. This approach is powerful but often impractical, expensive, or dangerous in real-world contexts. We can’t let a self-driving car “explore” by crashing thousands of times or experiment recklessly in healthcare. ...

[Conservative Q-Learning for Offline Reinforcement Learning 🔗](https://arxiv.org/abs/2006.04779)

Learning from the Past: How Conservative Q-Learning Unlocks Offline Reinforcement Learning

Imagine training a robot to cook a meal. The traditional approach in Reinforcement Learning (RL) is trial and error. The robot might try picking up an egg — sometimes succeeding, sometimes dropping it and making a mess. After thousands of attempts, it eventually learns. But what if we already have a massive dataset of a human chef cooking? Could the robot learn just by watching, without ever cracking an egg itself? ...

[NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review 🔗](https://arxiv.org/abs/2210.00379)

NeRF, Gaussian Splatting, and Beyond: A Guided Tour of Neural Radiance Fields

In March 2020, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” introduced a deceptively simple idea that reshaped how we think about 3D scene representation. From a set of posed 2D photos, a compact neural network could learn a continuous, view-consistent model of scene appearance and geometry, then synthesize photorealistic novel views. Over the next five years NeRF inspired a torrent of follow-up work: faster training, better geometry, robust sparse-view methods, generative 3D synthesis, and application-focused systems for urban scenes, human avatars, and SLAM. ...

[Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction 🔗](https://arxiv.org/abs/2302.07817)

TPVFormer: Reconstructing a 3D World from 2D Snapshots with Tri-Perspective View

For an autonomous vehicle to navigate our chaotic world, it needs more than just GPS and rules—it must see and understand its surroundings in rich, 3D detail. Beyond detecting cars and pedestrians, it should recognize the space they occupy, the terrain’s contours, the location of sidewalks, and the canopy of trees overhead. This is the essence of 3D Semantic Occupancy Prediction: building a complete, labeled 3D map of the environment. ...

[Structured 3D Latents for Scalable and Versatile 3D Generation 🔗](https://arxiv.org/abs/2412.01506)

TRELLIS: Weaving High-Quality 3D Worlds with a Unified Latent Structure

Figure 1: High-quality 3D assets generated by TRELLIS in various formats from text or image prompts. Demonstrates versatile generation, vivid appearances with 3D Gaussians or Radiance Fields, detailed geometries with meshes, and flexible editing. The world of AI-generated content has been dominated by stunning 2D imagery. Models like DALL-E and Midjourney can conjure photorealistic scenes and fantastical art from a simple text prompt. But what about the third dimension? ...

[Grounding Image Matching in 3D with MASt3R 🔗](https://arxiv.org/abs/2406.09756)

Beyond Pixels: How MASt3R Grounds 2D Image Matching in 3D Reality

Figure 1: MASt3R predicts dense pixel correspondences even under extreme viewpoint shifts, enabling precise camera calibration, pose estimation, and 3D reconstruction. Image matching is one of the unsung heroes of computer vision. It’s the fundamental building block behind a huge range of applications—from the photogrammetry used to create 3D models in movies and video games, to navigation systems in self-driving cars and robots. The task sounds simple: given two images of the same scene, figure out which pixels in one correspond to which pixels in the other. ...

[LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation 🔗](https://arxiv.org/abs/2402.05054)

LGM: Creating High-Resolution 3D Models in 5 Seconds with Gaussian Splatting

Creating 3D content for games, virtual reality, and films has traditionally been a labor-intensive process, requiring skilled artists and hours of meticulous work. But what if you could generate a highly detailed 3D model from a single image or a line of text in mere seconds? This is the promise of generative AI for 3D — a rapidly evolving field that’s seen explosive growth. Early methods were revolutionary but slow, often taking minutes or even hours to optimize a single 3D asset. More recent feed-forward models brought generation time down to seconds, but at a cost: lower resolution and less geometric detail. The core challenge has been balancing speed and quality. Can we have both? ...

[DUSt3R: Geometric 3D Vision Made Easy 🔗](https://arxiv.org/abs/2312.14132)

How DUSt3R is Redefining 3D Reconstruction — No Camera Info Required

From Photos to 3D Models: A Simpler Path Forward Creating a detailed 3D model from a collection of regular photos has long been considered one of the ultimate goals in computer vision. For decades, the standard approach has been a complex, multi-stage pipeline: first Structure-from-Motion (SfM) to estimate camera parameters and sparse geometry, followed by Multi-View Stereo (MVS) to produce dense surface models. This traditional pipeline is a monumental achievement, underpinning applications from Google Maps’ 3D view to cultural heritage preservation, robotics navigation, and autonomous driving. But it’s also fragile — each step depends on the success of the previous one, and an error at any point can cascade through the pipeline, causing the final reconstruction to fail. Calibration must be precise, the number of views sufficient, motion variations adequate, and surfaces well-textured — otherwise, reconstructions can collapse. ...

[Zero-1-to-3: Zero-shot One Image to 3D Object 🔗](https://arxiv.org/abs/2303.11328)

Zero-1-to-3: How AI Can Imagine a 3D Object from a Single Photo

When you look at a photograph of a car, you don’t just see a flat, two-dimensional collection of pixels. Your brain, drawing on a lifetime of experience, instantly constructs a mental model of a three-dimensional object. You can effortlessly imagine what that car looks like from the side, from the back, or from above, even if you’ve never seen that specific model before. This ability to infer 3D structure from a single 2D view is a cornerstone of human perception. For artificial intelligence, however, it’s a monumental challenge. Traditionally, creating 3D models from images required multiple photographs from different angles, specialized depth-sensing cameras, or vast, expensive datasets of 3D models for training. These methods are powerful but limited; they don’t scale well and often fail on objects they haven’t been explicitly trained on. ...

[A COMPREHENSIVE REVIEW OF YOLO ARCHITECTURES IN COMPUTER VISION: FROM YOLOV1 TO YOLOV8 AND YOLO-NAS 🔗](https://arxiv.org/abs/2304.00501)

From v1 to v8 and Beyond: The Complete Story of YOLO

In the world of computer vision, few algorithms have made an impact as significant and lasting as YOLO (You Only Look Once). From enabling self-driving cars to perceive the world around them to powering automated checkout systems, real-time object detection has become a cornerstone of modern AI. At the heart of this revolution is YOLO—a family of models celebrated for their incredible balance of speed and accuracy. Since its debut in 2015, YOLO has undergone an extraordinary evolution. Each new version has pushed the boundaries of what’s possible, introducing clever architectural changes and novel training techniques. This article will take you on a comprehensive journey through the entire history of YOLO, from the groundbreaking original all the way to the latest state-of-the-art versions like YOLOv8 and the AI-designed YOLO-NAS. ...

[Efficient Multi-modal Large Language Models via Progressive Consistency Distillation 🔗](https://arxiv.org/abs/2510.00515)

The Tortoise and the Hare of AI: How Gradual Learning Makes Visual AI Faster

Multi-modal Large Language Models (MLLMs) are reshaping how we interact with AI. Models like LLaVA can look at an image and hold a conversation about it—combining the seeing ability of computer vision with the reasoning power of large language models (LLMs). They’re like high-performance sports cars: incredible on the track, but they burn through fuel—in this case, computational resources—at a staggering rate. The main fuel drain? The sheer number of visual tokens. While a text prompt might be dozens of tokens, a single image is often broken into hundreds of them, and high-resolution images or multi-frame videos can explode this count further. This data flood creates a computational bottleneck—slowing inference speed and hogging memory. ...

[Large Reasoning Models Learn Better Alignment from Flawed Thinking 🔗](https://arxiv.org/abs/2510.00938)

RECAP: Teaching AI to Think Critically by Showing It Flawed Reasoning

Large Language Models (LLMs) are becoming increasingly powerful, particularly a new class called Large Reasoning Models (LRMs). These models don’t just spit out an answer—they think by generating a step-by-step chain of thought (CoT) before coming to a conclusion. This reflective reasoning lets them tackle complex problems in math, coding, and beyond with remarkable results. But there’s a crack in the armor. Recent research has revealed that these sophisticated reasoning abilities are surprisingly brittle. A model can be nudged toward generating harmful content simply by giving its thought process a flawed starting point—this is called CoT prefilling. For example, starting the model’s chain of thought with a phrase like “I know how to do it. First…” can be enough to bypass safety training, leading to unsafe outputs. This raises a critical question: Do these models truly understand safety principles, or are they just skilled at following any reasoning path they’re given—whether good or bad? ...

[Apriel-1.5-15B-Thinker: Mid-training is all you need 🔗](https://arxiv.org/abs/2510.01141)

Mid-Training is All You Need: How a 15B Model Reached the AI Frontier

In the world of artificial intelligence, there’s a constant arms race. Tech giants are building ever-larger models with hundreds of billions—or even trillions—of parameters, pushing the boundaries of what’s possible. But this relentless pursuit of scale comes at a cost—literally. These colossal models require immense computational power, making them expensive to train and deploy, and often locking them away behind proprietary APIs. This creates a fundamental tension: how can we achieve state-of-the-art AI reasoning without a state-of-the-art budget? Can a smaller, more accessible model compete with the giants? ...

[THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN 🔗](https://arxiv.org/abs/2509.26507)

The Dragon Hatchling: A New AI Architecture Bridging Transformers and the Brain

Transformers gave us the large language models that changed everything. They are powerful, trainable at scale, and extremely effective in practice. Yet they remain — at least partly — a mystery: dense tensors, batch-normalized stacks, and attention matrices are excellent engineering abstractions, but they don’t look much like the massively-parallel, locally-interacting network of neurons and synapses that is the human brain. The paper “THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN” introduces a new family of architectures — BDH and its GPU-friendly variant BDH-GPU — that aim to bridge this gap. BDH is a graph-first, biologically inspired language-and-reasoning architecture whose GPU-friendly instantiation matches Transformer-like performance while sporting interpretable, local dynamics that look a lot like neurons and synapses. This post unpacks the core ideas, the intuition, and the key empirical findings so you can understand how BDH sits between tensors and biology. ...

Unfolding Time: How a Simple Neural Network Learned the Rules of Language

How does the human mind handle time? It’s a question that feels both simple and impossibly complex. So much of what we do—from understanding a melody to catching a ball to having a conversation—depends on processing sequences of events as they unfold. Language, in particular, is a river of information flowing through time. The meaning of a sentence isn’t just in the words themselves, but in their order. “Dog bites man” is ordinary news; “Man bites dog” is a headline. ...

[Designing Network Design Strategies Through Gradient Path Analysis 🔗](https://arxiv.org/abs/2211.04800)

Rethinking Neural Network Design: A Deep Dive into Gradient Path Analysis

When designing deep neural networks, we usually focus on how data flows forward through the model. We stack layers, implement complex feature fusion mechanisms, and add attention modules to transform an input into the desired output. This traditional “data path” perspective has brought us powerful architectures like ResNet, DenseNet, and Transformers. But what if this forward-focused view is only half the story? What if the key to building more efficient and more powerful networks is to examine how information flows backward? ...

[Finetuned Language Models Are Zero-Shot Learners 🔗](https://arxiv.org/abs/2109.01652)

Just Tell the Model What to Do: How Instruction Tuning Unlocks Zero-Shot Learning

Large language models (LLMs) have shown astonishing capabilities: writing code, composing essays, and answering complex questions. Much of that success rests on few-shot learning—showing a model a few examples in the prompt and letting it generalize. But few-shot prompting has drawbacks: you need examples, and you often must engineer the prompt carefully. What if we could simply tell a model, in plain English, what we want it to do—and have it do it well without any example? That’s the core question of “Finetuned Language Models Are Zero-Shot Learners” (Google Research). The paper shows that a surprisingly simple trick—instruction tuning—turns large pretrained models into strong zero-shot learners. The instruction-tuned model, FLAN (Finetuned Language Net), improves zero-shot performance across many tasks and even beats GPT-3 (175B) zero-shot on most evaluated datasets. ...

GPT-3: The Dawn of Few-Shot Learning

The Fine-Tuning Treadmill: A Problem of Scale For years, the dominant paradigm in Natural Language Processing (NLP) has been a two-step dance. First, pre-train a massive, general-purpose language model on a vast ocean of text data. These models, such as BERT or RoBERTa, learn intricate patterns of language—grammar, facts, reasoning abilities, and even some biases. The second step is to take this powerful but general model and specialize it for a specific task through fine-tuning. ...

[Evaluating Large Language Models Trained on Code 🔗](https://arxiv.org/abs/2107.03374)

Inside Codex: The AI Pair Programmer That Powers GitHub Copilot

For decades, the idea of an AI that could write its own code has been a holy grail of computer science. We’ve seen glimpses of this future in science fiction, but in reality, teaching a machine the logic, creativity, and precision required for programming has been an immense challenge. When large language models (LLMs) like GPT-3 emerged, they revealed a surprising, albeit rudimentary, ability to generate simple code snippets from natural language prompts — even though they weren’t explicitly trained to code. ...

[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 🔗](https://arxiv.org/abs/2101.03961)

The Switch Transformer: A Trillion-Parameter AI Model that's Surprisingly Efficient

In the world of AI—and especially in Natural Language Processing (NLP)—the mantra for the past few years has been “bigger is better.” We’ve seen a parade of colossal language models like GPT-3, T5, and Megatron, each pushing the boundaries of size and performance. Scaling these models has unlocked incredible capabilities, from writing coherent essays to generating code. But it comes at a steep price: astronomical computational costs. Training these massive dense models, where every parameter is used for every single input, requires supercomputers and consumes enormous amounts of energy. ...