Papers

[Less is More: Recursive Reasoning with Tiny Networks 🔗](https://arxiv.org/abs/2510.04871)

Less is More: How Tiny Recursive Networks Outsmart Giant AI Models on Complex Puzzles

Large Language Models (LLMs) like GPT-4 and Gemini are computational powerhouses, capable of writing code, composing poetry, and answering a vast range of questions. But for all their might, they have an Achilles’ heel: complex, multi-step reasoning puzzles. Tasks like solving a tricky Sudoku or deciphering the abstract patterns in the ARC-AGI benchmark can cause even the most advanced LLMs to stumble. Their auto-regressive, token-by-token generation process means a single mistake can derail the entire solution, with no easy way to backtrack and correct course. ...

[UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS 🔗](https://arxiv.org/abs/1511.06434)

DCGANs Explained: Unlocking the Power of Unsupervised Learning with Generative AI

In the world of computer vision, Convolutional Neural Networks (CNNs) have been the undisputed champions for years. Give a CNN enough labeled images of cats and dogs, and it will learn to tell them apart with superhuman accuracy. This is supervised learning, and it has powered modern AI applications from photo tagging to medical imaging. But what happens when you don’t have labels? The internet is overflowing with billions of images, but only a tiny fraction are neatly categorized. This is the challenge of unsupervised learning: can a model learn meaningful, reusable knowledge about the visual world from a massive, messy pile of unlabeled data? ...

[Denoising Diffusion Probabilistic Models 🔗](https://arxiv.org/abs/2006.11239)

From Noise to High-Fidelity Images — A Deep Dive into Denoising Diffusion Models

In the last decade, AI has dazzled the world with deep generative models capable of producing realistic images, audio, and text from scratch. We’ve seen Generative Adversarial Networks (GANs) generate lifelike portraits and Variational Autoencoders (VAEs) learn rich latent representations. But in 2020, a paper titled Denoising Diffusion Probabilistic Models from researchers at UC Berkeley reshaped the conversation. This work introduced a class of models, based on ideas from nonequilibrium thermodynamics first explored in 2015, that were shown for the first time to produce exceptionally high-quality images, rivaling — and in some cases surpassing — the best GANs. ...

[Reflection: Language Agents with Verbal Reinforcement Learning 🔗](https://arxiv.org/abs/2303.11366)

Beyond Trial and Error: How LLM Agents Can Learn by Talking to Themselves

Large Language Models (LLMs) are breaking out of the chatbot box. We’re increasingly seeing them power autonomous agents that can interact with software, play games, and browse the web to accomplish complex goals. But there’s a catch: when these agents make a mistake, how do they learn not to repeat it? Traditionally, the answer in AI has been Reinforcement Learning (RL)—a process of trial and error where an agent is rewarded for good actions and penalized for bad ones. However, applying traditional RL to massive LLMs is incredibly slow and computationally expensive, often requiring months of training and enormous GPU resources to fine-tune billions of parameters. As a result, most LLM agents today learn only from a handful of carefully designed examples in their prompt. ...

[CURL: Contrastive Unsupervised Representations for Reinforcement Learning 🔗](https://arxiv.org/abs/2004.04136)

Learning from Pixels Just Got a Lot Faster: A Deep Dive into CURL

Reinforcement Learning (RL) has given us agents that can master complex video games, control simulated robots, and even grasp real-world objects. However, there’s a catch that has long plagued the field: RL is notoriously data-hungry. An agent often needs millions of interactions with its environment to learn a task. In a fast simulation, that’s fine—but in the real world, where a robot arm might take seconds to perform a single action, this can translate to months or years of training. ...

[Decision Transformer: Reinforcement Learning via Sequence Modeling 🔗](https://arxiv.org/abs/2106.01345)

Decision Transformer: When Language Models Learn to Play Games

What if you could tackle a complex reinforcement learning problem the same way you’d complete a sentence? This is the radical and powerful idea behind the Decision Transformer—a paper that reframes the entire field of sequential decision-making. For decades, Reinforcement Learning (RL) has been dominated by algorithms that learn value functions and policy gradients, often wrestling with complex issues like temporal credit assignment, bootstrapping instability, and discounting. But what if we could sidestep all of that? ...

[D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2004.07219)

Beyond Online Training: Introducing D4RL for Real-World Offline Reinforcement Learning

The past decade has shown us the incredible power of large datasets. From ImageNet fueling the computer vision revolution to massive text corpora enabling models like GPT, it’s clear: data is the lifeblood of modern machine learning. Yet one of the most exciting fields—Reinforcement Learning (RL)—has largely been excluded from this data-driven paradigm. Traditionally, RL agents learn through active, online interaction with an environment—playing games, controlling robots, simulating trades—building policies through trial and error. This approach is powerful but often impractical, expensive, or dangerous in real-world contexts. We can’t let a self-driving car “explore” by crashing thousands of times or experiment recklessly in healthcare. ...

[Conservative Q-Learning for Offline Reinforcement Learning 🔗](https://arxiv.org/abs/2006.04779)

Learning from the Past: How Conservative Q-Learning Unlocks Offline Reinforcement Learning

Imagine training a robot to cook a meal. The traditional approach in Reinforcement Learning (RL) is trial and error. The robot might try picking up an egg — sometimes succeeding, sometimes dropping it and making a mess. After thousands of attempts, it eventually learns. But what if we already have a massive dataset of a human chef cooking? Could the robot learn just by watching, without ever cracking an egg itself? ...

[NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review 🔗](https://arxiv.org/abs/2210.00379)

NeRF, Gaussian Splatting, and Beyond: A Guided Tour of Neural Radiance Fields

In March 2020, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” introduced a deceptively simple idea that reshaped how we think about 3D scene representation. From a set of posed 2D photos, a compact neural network could learn a continuous, view-consistent model of scene appearance and geometry, then synthesize photorealistic novel views. Over the next five years NeRF inspired a torrent of follow-up work: faster training, better geometry, robust sparse-view methods, generative 3D synthesis, and application-focused systems for urban scenes, human avatars, and SLAM. ...

[Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction 🔗](https://arxiv.org/abs/2302.07817)

TPVFormer: Reconstructing a 3D World from 2D Snapshots with Tri-Perspective View

For an autonomous vehicle to navigate our chaotic world, it needs more than just GPS and rules—it must see and understand its surroundings in rich, 3D detail. Beyond detecting cars and pedestrians, it should recognize the space they occupy, the terrain’s contours, the location of sidewalks, and the canopy of trees overhead. This is the essence of 3D Semantic Occupancy Prediction: building a complete, labeled 3D map of the environment. ...

[Structured 3D Latents for Scalable and Versatile 3D Generation 🔗](https://arxiv.org/abs/2412.01506)

TRELLIS: Weaving High-Quality 3D Worlds with a Unified Latent Structure

Figure 1: High-quality 3D assets generated by TRELLIS in various formats from text or image prompts. Demonstrates versatile generation, vivid appearances with 3D Gaussians or Radiance Fields, detailed geometries with meshes, and flexible editing. The world of AI-generated content has been dominated by stunning 2D imagery. Models like DALL-E and Midjourney can conjure photorealistic scenes and fantastical art from a simple text prompt. But what about the third dimension? ...

[Grounding Image Matching in 3D with MASt3R 🔗](https://arxiv.org/abs/2406.09756)

Beyond Pixels: How MASt3R Grounds 2D Image Matching in 3D Reality

Figure 1: MASt3R predicts dense pixel correspondences even under extreme viewpoint shifts, enabling precise camera calibration, pose estimation, and 3D reconstruction. Image matching is one of the unsung heroes of computer vision. It’s the fundamental building block behind a huge range of applications—from the photogrammetry used to create 3D models in movies and video games, to navigation systems in self-driving cars and robots. The task sounds simple: given two images of the same scene, figure out which pixels in one correspond to which pixels in the other. ...

[LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation 🔗](https://arxiv.org/abs/2402.05054)

LGM: Creating High-Resolution 3D Models in 5 Seconds with Gaussian Splatting

Creating 3D content for games, virtual reality, and films has traditionally been a labor-intensive process, requiring skilled artists and hours of meticulous work. But what if you could generate a highly detailed 3D model from a single image or a line of text in mere seconds? This is the promise of generative AI for 3D — a rapidly evolving field that’s seen explosive growth. Early methods were revolutionary but slow, often taking minutes or even hours to optimize a single 3D asset. More recent feed-forward models brought generation time down to seconds, but at a cost: lower resolution and less geometric detail. The core challenge has been balancing speed and quality. Can we have both? ...

[DUSt3R: Geometric 3D Vision Made Easy 🔗](https://arxiv.org/abs/2312.14132)

How DUSt3R is Redefining 3D Reconstruction — No Camera Info Required

From Photos to 3D Models: A Simpler Path Forward Creating a detailed 3D model from a collection of regular photos has long been considered one of the ultimate goals in computer vision. For decades, the standard approach has been a complex, multi-stage pipeline: first Structure-from-Motion (SfM) to estimate camera parameters and sparse geometry, followed by Multi-View Stereo (MVS) to produce dense surface models. This traditional pipeline is a monumental achievement, underpinning applications from Google Maps’ 3D view to cultural heritage preservation, robotics navigation, and autonomous driving. But it’s also fragile — each step depends on the success of the previous one, and an error at any point can cascade through the pipeline, causing the final reconstruction to fail. Calibration must be precise, the number of views sufficient, motion variations adequate, and surfaces well-textured — otherwise, reconstructions can collapse. ...

[Zero-1-to-3: Zero-shot One Image to 3D Object 🔗](https://arxiv.org/abs/2303.11328)

Zero-1-to-3: How AI Can Imagine a 3D Object from a Single Photo

When you look at a photograph of a car, you don’t just see a flat, two-dimensional collection of pixels. Your brain, drawing on a lifetime of experience, instantly constructs a mental model of a three-dimensional object. You can effortlessly imagine what that car looks like from the side, from the back, or from above, even if you’ve never seen that specific model before. This ability to infer 3D structure from a single 2D view is a cornerstone of human perception. For artificial intelligence, however, it’s a monumental challenge. Traditionally, creating 3D models from images required multiple photographs from different angles, specialized depth-sensing cameras, or vast, expensive datasets of 3D models for training. These methods are powerful but limited; they don’t scale well and often fail on objects they haven’t been explicitly trained on. ...

[A COMPREHENSIVE REVIEW OF YOLO ARCHITECTURES IN COMPUTER VISION: FROM YOLOV1 TO YOLOV8 AND YOLO-NAS 🔗](https://arxiv.org/abs/2304.00501)

From v1 to v8 and Beyond: The Complete Story of YOLO

In the world of computer vision, few algorithms have made an impact as significant and lasting as YOLO (You Only Look Once). From enabling self-driving cars to perceive the world around them to powering automated checkout systems, real-time object detection has become a cornerstone of modern AI. At the heart of this revolution is YOLO—a family of models celebrated for their incredible balance of speed and accuracy. Since its debut in 2015, YOLO has undergone an extraordinary evolution. Each new version has pushed the boundaries of what’s possible, introducing clever architectural changes and novel training techniques. This article will take you on a comprehensive journey through the entire history of YOLO, from the groundbreaking original all the way to the latest state-of-the-art versions like YOLOv8 and the AI-designed YOLO-NAS. ...

[Efficient Multi-modal Large Language Models via Progressive Consistency Distillation 🔗](https://arxiv.org/abs/2510.00515)

The Tortoise and the Hare of AI: How Gradual Learning Makes Visual AI Faster

Multi-modal Large Language Models (MLLMs) are reshaping how we interact with AI. Models like LLaVA can look at an image and hold a conversation about it—combining the seeing ability of computer vision with the reasoning power of large language models (LLMs). They’re like high-performance sports cars: incredible on the track, but they burn through fuel—in this case, computational resources—at a staggering rate. The main fuel drain? The sheer number of visual tokens. While a text prompt might be dozens of tokens, a single image is often broken into hundreds of them, and high-resolution images or multi-frame videos can explode this count further. This data flood creates a computational bottleneck—slowing inference speed and hogging memory. ...

[Large Reasoning Models Learn Better Alignment from Flawed Thinking 🔗](https://arxiv.org/abs/2510.00938)

RECAP: Teaching AI to Think Critically by Showing It Flawed Reasoning

Large Language Models (LLMs) are becoming increasingly powerful, particularly a new class called Large Reasoning Models (LRMs). These models don’t just spit out an answer—they think by generating a step-by-step chain of thought (CoT) before coming to a conclusion. This reflective reasoning lets them tackle complex problems in math, coding, and beyond with remarkable results. But there’s a crack in the armor. Recent research has revealed that these sophisticated reasoning abilities are surprisingly brittle. A model can be nudged toward generating harmful content simply by giving its thought process a flawed starting point—this is called CoT prefilling. For example, starting the model’s chain of thought with a phrase like “I know how to do it. First…” can be enough to bypass safety training, leading to unsafe outputs. This raises a critical question: Do these models truly understand safety principles, or are they just skilled at following any reasoning path they’re given—whether good or bad? ...

[Apriel-1.5-15B-Thinker: Mid-training is all you need 🔗](https://arxiv.org/abs/2510.01141)

Mid-Training is All You Need: How a 15B Model Reached the AI Frontier

In the world of artificial intelligence, there’s a constant arms race. Tech giants are building ever-larger models with hundreds of billions—or even trillions—of parameters, pushing the boundaries of what’s possible. But this relentless pursuit of scale comes at a cost—literally. These colossal models require immense computational power, making them expensive to train and deploy, and often locking them away behind proprietary APIs. This creates a fundamental tension: how can we achieve state-of-the-art AI reasoning without a state-of-the-art budget? Can a smaller, more accessible model compete with the giants? ...

[THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN 🔗](https://arxiv.org/abs/2509.26507)

The Dragon Hatchling: A New AI Architecture Bridging Transformers and the Brain

Transformers gave us the large language models that changed everything. They are powerful, trainable at scale, and extremely effective in practice. Yet they remain — at least partly — a mystery: dense tensors, batch-normalized stacks, and attention matrices are excellent engineering abstractions, but they don’t look much like the massively-parallel, locally-interacting network of neurons and synapses that is the human brain. The paper “THE DRAGON HATCHLING: THE MISSING LINK BETWEEN THE TRANSFORMER AND MODELS OF THE BRAIN” introduces a new family of architectures — BDH and its GPU-friendly variant BDH-GPU — that aim to bridge this gap. BDH is a graph-first, biologically inspired language-and-reasoning architecture whose GPU-friendly instantiation matches Transformer-like performance while sporting interpretable, local dynamics that look a lot like neurons and synapses. This post unpacks the core ideas, the intuition, and the key empirical findings so you can understand how BDH sits between tensors and biology. ...