Deep Paper

[Rethinking Thinking Tokens: LLMs as Improvement Operators 🔗](https://arxiv.org/abs/2510.01123)

Beyond Chain-of-Thought: How Parallel Thinking and Self-Refinement Unlock Smarter LLMs

Introduction: The High Cost of Thinking For years, the go-to method for getting Large Language Models (LLMs) to solve complex reasoning problems has been to make them “think out loud.” By prompting them to generate a step-by-step Chain-of-Thought (CoT), we encourage them to break down complex problems, explore different approaches, and correct their own mistakes along the way. The informal rule has been simple: the more “thinking tokens” a model generates, the better its final answer. ...

[MesaNet: Sequence Modeling by Locally Optimal Test-Time Training 🔗](https://arxiv.org/abs/2506.05233)

Beyond Transformers: How MesaNet Learns In-Context by Optimizing on the Fly

For years Transformers have been the dominant architecture for sequence modeling — from language to code to long documents. Their softmax self-attention gives them unmatched flexibility to access arbitrary past tokens, but that flexibility comes at a cost: during inference memory and compute scale linearly with sequence length. That makes very long-context problems expensive, and it motivates revisiting an older idea: recurrent-style layers that maintain a fixed-size state and update it as tokens arrive. ...

[Test-Time Training Done Right 🔗](https://arxiv.org/abs/2505.23884)

LaCT: Why Bigger Is Better for Test-Time Training and Long-Context AI

LaCT: Why Bigger Is Better for Test-Time Training and Long-Context AI The ability to process and understand long sequences of information—be it a lengthy document, a high-resolution image collection, or a long video—is one of the defining frontiers in artificial intelligence. Transformers have transformed how neural networks handle sequential data, yet their core self-attention mechanism scales quadratically with sequence length, making it inefficient for long contexts. This has fueled a wave of research aimed at finding faster, more memory-efficient architectures. ...

[IN-THE-FLOW AGENTIC SYSTEM OPTIMIZATION FOR EFFECTIVE PLANNING AND TOOL USE 🔗](https://arxiv.org/abs/2510.05592)

AgentFlow: Training LLM Agents to Think, Plan, and Use Tools Effectively

Large Language Models (LLMs) have made stunning progress in reasoning tasks—solving math problems, answering questions, writing code—but their capabilities often depend on two key strategies: reinforcement learning (RL) from verifiable feedback and tool use (such as calling a web search or executing Python code). When combined, these strategies create powerful “LLM agents” that can reason interactively, retrieve facts, and perform calculations. Yet, the prevailing approaches to building these agents face a major bottleneck. Most train a single, monolithic policy model that interleaves reasoning, tool calls, and answer generation within one long context window. As tasks get more complex—with more tools and longer reasoning horizons—this all-in-one setup becomes unstable during training and struggles to generalize to new tasks. ...

AVMoE: How a 'Mixture of Experts' Can Master Sight and Sound

Introduction: The Challenge of Seeing and Hearing Imagine watching a basketball game on TV. You see players dribbling and shooting, but you hear the commentator’s voice, the roar of the crowd, and maybe a faint squeak of sneakers. For a machine, understanding this scene is incredibly complex. The visual cues (basketball, cheering fans) and the dominant audio cues (speech, cheering) don’t always align perfectly. How can an AI learn to focus on the right signals when vision and sound tell slightly different stories? ...

[MM-HELIX: BOOSTING MULTIMODAL LONG-CHAIN REFLECTIVE REASONING WITH HOLISTIC PLATFORM AND ADAPTIVE HYBRID POLICY OPTIMIZATION 🔗](https://arxiv.org/abs/2510.08540)

Teaching AI to Think, Backtrack, and Try Again: A Deep Dive into MM-HELIX

When you tackle a complex puzzle like Sudoku or a strategy game like chess, what does your thought process look like? You likely don’t find the solution in one perfect, linear sequence of steps. Instead, you hypothesize, test your ideas, hit dead ends, backtrack, and refine your strategy. This cycle of trial, error, and correction—what cognitive scientists call reflective reasoning—is the hallmark of human intelligence. It’s how we solve hard problems. ...

[Agent Learning via Early Experience 🔗](https://arxiv.org/abs/2510.08558)

Beyond Imitation: How Early Experience Lets Agents Learn from Their Own Mistakes

Beyond Imitation: How Early Experience Lets Agents Learn from Their Own Mistakes The long-standing dream of AI is an agent that learns by acting in the world — experimenting, failing, and improving without a human constantly telling it what to do. Language agents built on large language models (LLMs) are a huge step toward that dream: they can navigate websites, call APIs, chain tools, and even help with scientific workflows. But training them remains trapped between two extremes. ...

[MEMMAMBA: RETHINKING MEMORY PATTERNS IN STATE SPACE MODEL 🔗](https://arxiv.org/abs/2510.03279)

MemMamba: Teaching AI to 'Take Notes' and Conquer the Challenge of Ultra-Long Sequences

Introduction: The AI Memory Problem Imagine asking an AI to summarize a thousand-page novel or analyze a massive codebase. For it to succeed, it needs an incredible memory—the ability to recall a character’s first appearance from Chapter 1 when they reappear in Chapter 50, or understand how a function defined at line 200 connects to one thousands of lines later. This is the challenge of long-sequence modeling, one of the toughest problems in modern artificial intelligence. ...

[VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning 🔗](https://arxiv.org/abs/2510.08555)

Beyond the First Frame: Introducing VideoCanvas for Arbitrary Video Creation

Video generation is advancing at a breathtaking pace. We’ve gone from blurry, short clips to stunning, high-definition videos created from simple text prompts. Many modern models can animate a single still image, a task known as image-to-video (I2V), breathing motion into static content. But what if you want more control? What if you want to define not only how a video begins but also key moments in the middle and the exact ending you imagine? ...

[DREAMOMNI2: MULTIMODAL INSTRUCTION-BASED EDITING AND GENERATION 🔗](https://arxiv.org/abs/2510.06679)

DreamOmni2: Teaching AI to Edit and Create Images with Both Words and Pictures

Have you ever tried to get an AI image generator to create something exactly as you pictured it? Maybe you wanted to capture the distinctive art style of a niche painter, the rough texture of a vintage fabric, or the precise golden-hour lighting of a photograph you love. You type out a detailed prompt, but the words just don’t quite capture the nuance. You think, “If only I could show it what I mean.” ...

[UniVideo: Unified Understanding, Generation, and Editing for Videos 🔗](https://arxiv.org/abs/2510.08377)

Beyond Text-to-Video: How UniVideo Unifies Understanding, Generation, and Editing

If you’ve been following the world of AI, you’ve seen the incredible leap from generating text to creating stunning images from a simple prompt. But the next frontier has always been video. While models that turn text into short video clips are becoming more common, they often feel like one-trick ponies. What if we wanted an AI that could not only generate a video but also understand complex instructions involving images, edit existing videos in sophisticated ways, and even follow a hand-drawn storyboard? ...

[FROM WHAT TO WHY: A MULTI-AGENT SYSTEM FOR EVIDENCE-BASED CHEMICAL REACTION CONDITION REASONING 🔗](https://arxiv.org/abs/2509.23768)

ChemMAS: Teaching AI to Reason Like a Chemist

Finding the perfect recipe for a chemical reaction is one of the enduring challenges in chemistry. For any given transformation of reactants into products, countless combinations of solvents, catalysts, temperatures, and reagents could be used. Selecting the optimal combination is vital for success yet often requires extensive manual trial-and-error and deep expert intuition. This bottleneck slows progress in critical areas such as drug discovery and materials synthesis. Artificial intelligence has long promised to accelerate this process. Early models could predict what conditions might work, often with impressive accuracy. However, they operated largely as black boxes—suggesting a solvent or catalyst without explaining why it was suitable. For scientists, the “why” matters even more than the “what.” Understanding why a specific temperature or reagent is appropriate reveals the underlying mechanism, supports innovation, and builds trust in AI-driven systems. ...

[META-AWARENESS ENHANCES REASONING MODELS: SELF-ALIGNMENT REINFORCEMENT LEARNING 🔗](https://arxiv.org/abs/2510.03259)

MASA: Teaching AI Models to 'Think About Their Thinking'

Introduction: The Missing Piece in AI Reasoning Humans possess a remarkable cognitive skill called meta-cognition, or “thinking about thinking.” It’s our ability to assess our own knowledge, judge a problem’s difficulty, and plan our approach accordingly. We know intuitively when a math problem needs deep analysis versus a quick calculation, or when to look up a fact rather than struggle to recall it. This self-awareness makes our reasoning both efficient and effective. ...

[When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs 🔗](https://arxiv.org/abs/2510.07499)

Beyond Bigger Contexts: Teaching LCLMs to Think with Reusable Reasoning

Introduction: The Long-Context Revolution and Its Hidden Bottleneck We are firmly in the era of Long-Context Language Models (LCLMs). Frontier models like Claude, Gemini, and GPT-4.1 can now handle prompts stretching to hundreds of thousands, or even millions, of tokens. This capability unlocks incredible opportunities: rather than retrieving a minimal set of relevant documents for a model to read, we can now imagine “just putting everything into the prompt.” Need to answer a question about a 500-page legal contract? Include the entire document. For many, this seems to solve the age-old weakness of Retrieval-Augmented Generation (RAG)—where a flawed retrieval step could derail the whole process. ...

[Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward 🔗](https://arxiv.org/abs/2510.03222)

Reasoning Sparks: How Tiny Probabilities Unlock AI's Problem-Solving Superpowers

Large Language Models (LLMs) have grown remarkably adept at complex reasoning, successfully tackling competitive mathematics problems, logic puzzles, and intricate coding tasks. A core driver of this progress has been Reinforcement Learning with Verifiable Rewards (RLVR) — a training approach where solutions are automatically checked, and correct outputs earn rewards while incorrect ones incur penalties. This creates a powerful feedback loop for learning. Yet there’s a persistent challenge. After an initial improvement phase, RLVR-trained models often hit a stubborn plateau in performance — and then collapse. This collapse is coupled with a strong drop in policy entropy, a metric reflecting how much the model explores alternative ideas. In practical terms, when entropy falls, the model stops experimenting, becomes overconfident in familiar solution paths, and loses its creative reasoning ability. ...

[ARTDECO: TOWARDS EFFICIENT AND HIGH-FIDELITY ON-THE-FLY 3D RECONSTRUCTION WITH STRUCTURED SCENE REPRESENTATION 🔗](https://arxiv.org/abs/2510.08551)

ARTDECO: Bridging SLAM and Foundation Models for Flawless On-the-Fly 3D Worlds

Creating a detailed, interactive 3D model of a real-world space has long been a holy grail in computer vision. Imagine capturing a video of your apartment on your phone and instantly having a photorealistic digital twin you can walk through in VR — or a robot using that same video to build a precise map for navigation. This is the promise of on-the-fly 3D reconstruction, a technology crucial for AR/VR, robotics, and real-to-sim content creation. ...

[GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning 🔗](https://arxiv.org/abs/2402.02130)

Beyond Text: How GITA Teaches AI to See and Reason with Graphs

Graphs are everywhere. From the social networks that connect us, to the knowledge bases that power search engines, to the molecular structures that define medicines — these networks of nodes and edges are a fundamental way we represent complex information. For years, specialized models like Graph Neural Networks (GNNs) have been the go-to tool for analyzing graphs. They are powerful, but often require deep expertise to design and tune for each specific task. They are far from user-friendly. ...

[Artificial Hippocampus Networks for Efficient Long-Context Modeling 🔗](https://arxiv.org/abs/2510.07318)

Brain-Inspired AI: How Artificial Hippocampus Networks Give LLMs Long-Term Memory

Large Language Models (LLMs) have transformed how we interact with technology, but they have a critical weakness: their memory. While they can process and generate human-like text, their ability to handle very long sequences of information—like an entire book, a lengthy legal document, or a complex codebase—is limited. This is because the dominant architecture, the Transformer, faces a fundamental trade-off. It can either have a perfect, lossless memory that becomes incredibly slow and expensive as the context grows, or it can use a compressed, fixed-size memory that is fast but inevitably forgets important details. ...

[MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 🔗](https://arxiv.org/abs/2405.15593)

Meet MicroAdam: The Memory-Miser Optimizer with Provable Convergence

Training massive deep learning models—especially Large Language Models (LLMs) with billions of parameters—is a monumental task. One of the biggest bottlenecks isn’t just computation, but the sheer memory required. A significant chunk of that memory is consumed not by the model weights themselves, but by the optimizer’s state—the extra data it needs to track in order to update the model effectively. The Adam optimizer, a go-to choice for training LLMs, is particularly memory-hungry. It stores two additional values for every parameter in the model, effectively doubling the memory footprint compared to simpler methods like SGD. For a model like LLaMA-2 7B, this means over 25 GB of memory for the optimizer alone! ...

[A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods 🔗](https://arxiv.org/abs/2305.15027)

Why Do Deep Ensembles Work? A New Theory Unites Them with Bayesian Methods

Quantifying uncertainty is one of the biggest hurdles in building truly trustworthy AI systems. For a model to be reliable, it needs to recognize what it doesn’t know. Whether it’s a self-driving car encountering an unusual obstacle or a medical AI analyzing a rare condition, we need our models to respond with, “I’m not sure” rather than making a confident but wrong prediction. Over the years, the machine learning community has developed several distinct approaches to this problem. On one side, we have the principled, probability-first world of Bayesian methods. These approaches, including techniques like Variational Inference (VI) and Langevin sampling, treat model parameters not as single point estimates, but as entire probability distributions—naturally capturing uncertainty. ...