Deep Paper

[Learning to Reason from Feedback at Test-Time 🔗](https://arxiv.org/abs/2502.15771)

Stop Repeating Mistakes: How LLMs Can Learn from Feedback in Real Time

Large Language Models (LLMs) are incredibly powerful, yet they struggle with one subtle weakness—complex, multi-step reasoning. Ask a model to solve an Olympiad-level math question or a competitive programming puzzle, and its first attempt is often wrong. The challenge isn’t generating an answer—it’s learning from failure effectively. Humans learn from mistakes. We rarely repeat the same error twice because we internalize what went wrong. Could LLMs do something similar? Could they learn from feedback while they’re being tested—improving with every iteration? ...

[BACKPROPAMINE: TRAINING SELF-MODIFYING NEURAL NETWORKS WITH DIFFERENTIABLE NEUROMODULATED PLASTICITY 🔗](https://arxiv.org/abs/2002.10585)

Backpropamine: Teaching Neural Networks to Rewire Themselves

The human brain is a masterpiece of adaptation. From learning a new language to mastering a musical instrument, humans can acquire complex skills throughout their lives. This remarkable ability—called lifelong learning—stands in stark contrast to how most artificial neural networks operate. Typically, an AI model is trained once on a large dataset and then deployed with its connections, or synaptic weights, frozen in place. If you want it to learn something new, you often have to retrain the entire system, a process that’s slow, costly, and prone to catastrophic forgetting—losing previously learned knowledge. ...

Beyond Backprop: How Self-Modifying Synapses Unlock Rapid Reinforcement Learning

Deep Reinforcement Learning (RL) has produced remarkable results — from agents that can master complex video games to robots that learn to walk and manipulate objects. The standard approach involves training a neural network with fixed weights, optimized over millions of trials using gradient descent. While powerful, this process is slow and produces highly specialized agents. Like a chess grandmaster who cannot play checkers, these agents excel at tightly defined tasks but fail to adapt when the rules change. ...

[COMMON SENSE IS ALL YOU NEED 🔗](https://arxiv.org/abs/2501.06642)

Why Your Cat Is Still Smarter Than the Most Advanced AI

Artificial intelligence has dazzled us with its capabilities. It can compose art, solve complex problems, and conduct conversations that sound remarkably human. Yet despite these achievements, AI still lacks something so fundamental that even your cat possesses it—common sense. As AI pioneer and Turing Award laureate Yann LeCun famously put it, “AI systems still lack the general common sense of a cat.” This isn’t just a witty remark—it’s a precise diagnosis of a critical gap in modern AI development. Our models excel at narrow tasks but fail to develop the intuitive, adaptive understanding of the world that even simple animals rely on to survive. ...

[Does More Inference-Time Compute Really Help Robustness? 🔗](https://arxiv.org/abs/2507.15974)

More Thinking, More Problems? When Extra Compute Hurts LLM Robustness

Large Language Models (LLMs) are getting smarter— not just by growing larger, but by thinking more. Researchers have found that allocating extra computational power during inference—letting the model generate a longer internal monologue or reasoning chain before giving a final answer—can significantly boost performance on complex tasks. Recent studies even suggest that this technique, known as inference-time scaling, makes models more robust against adversarial attacks. It seems like a win-win: a smarter and safer AI. ...

[Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection 🔗](https://arxiv.org/abs/2505.19475)

Train on the Fly: How LLMs Can Continuously Improve Themselves During Testing

Large Language Models (LLMs) are undeniably powerful, but they share a fundamental limitation: they are typically static. Once trained on massive datasets, their parameters are frozen, and they are deployed into the world—never learning again. This “train once, test forever” paradigm works well when test data resembles the training distribution. But what happens when the model encounters something new—a novel question type or a subtle domain shift? Its performance can drop dramatically. ...

[A Comprehensive Survey on Self-Interpretable Neural Networks 🔗](https://arxiv.org/abs/2501.15638)

Beyond the Black Box: A Deep Dive into Self-Interpretable Neural Networks

Beyond the Black Box: A Deep Dive into Self-Interpretable Neural Networks Neural networks power many modern AI breakthroughs — from medical imaging and drug discovery to recommendation systems and autonomous agents. Yet a recurring complaint is the same: these models are often “black boxes.” They make accurate predictions, but give little insight into why they make them. In high-stakes settings, opacity is unacceptable. Historically, the community relied heavily on post-hoc explanation tools (LIME, SHAP, Grad-CAM, etc.). These are useful forensic tools: you take a trained model and try to explain its behavior after the fact. But post-hoc methods can be brittle, expensive, or misleading — they don’t change what the model actually computes, and sometimes the explanations don’t align with the model’s internal reasoning. ...

[Advancing Large Language Model Attribution through Self-Improving 🔗](https://arxiv.org/abs/2410.13298)

How LLMs Can Teach Themselves to Be More Trustworthy

Large Language Models (LLMs) are extraordinary tools—able to answer complex questions, generate code, and summarize documents almost instantaneously. Yet they share a persistent flaw: hallucination. Ask an LLM for information, and it may confidently produce a fluent, detailed, and completely inaccurate answer. For casual use, this is merely amusing. But for serious applications—in research, journalism, or medicine—hallucinations are catastrophic. How can we trust AI systems if we cannot verify their claims? ...

[Amortized Bayesian Local Interpolation Network: Fast covariance parameter estimation for Gaussian Processes 🔗](https://arxiv.org/abs/2411.06324)

A-BLINK: Using Neural Networks to Turbocharge Gaussian Process Inference

Gaussian Processes (GPs) are the Swiss Army knife of spatial statistics. They are flexible, interpretable, and powerful tools for modeling spatially correlated data—used everywhere from predicting mineral deposits to mapping climate trends. A core feature of GPs is their ability to make predictions at new, unobserved locations through a process called Kriging. Unfortunately, this power comes at a steep computational price. The bottleneck lies in one expensive step: inverting a large covariance matrix. This operation scales roughly as \(O(n^3)\), meaning that doubling your dataset makes the computation eight times slower. For tens of thousands of spatial locations, the computational demands quickly become prohibitive. ...

[Testing and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models 🔗](https://arxiv.org/abs/2412.20586)

AI, Brain Models, and Messy Data: Building Robust Amortized Bayesian Inference

Introduction: The Hidden Peril of a Single Click Imagine a scientist running a cognitive experiment. Participants stare at a screen, making split-second decisions. Hundreds of data points—reaction times and choices—are collected. But what if one participant gets distracted? Or their finger slips, and they press a button unusually fast? This single, rogue data point—an outlier—can distort statistical analyses, twisting parameter estimates and potentially leading to completely wrong conclusions. This is a persistent headache in psychology and cognitive science, where human data is inherently noisy. Researchers often rely on sophisticated computational models like the Drift Diffusion Model (DDM), which explains how decisions unfold over time. Fitting such models is computationally expensive, but new AI-powered techniques called Amortized Bayesian Inference (ABI) have revolutionized this process, making inference nearly instantaneous. ...

[ROBUST SIMULATION-BASED INFERENCE UNDER MISSING DATA VIA NEURAL PROCESSES 🔗](https://arxiv.org/abs/2503.01287)

Bridging the Gaps: How RISE Handles Missing Data in Simulation-Based Inference

Introduction: The Missing Data Problem in Scientific Simulators From modeling the spread of diseases in epidemiology to simulating gravitational waves in astrophysics, computer simulations have become an indispensable tool for understanding complex phenomena. These mechanistic models—often called simulators—allow scientists to explore how theoretical models behave under various conditions, testing hypotheses and estimating parameters that are otherwise inaccessible through experiments. However, fitting these simulators to real-world data poses a major challenge. The likelihood function—the probability of observing the data given the model parameters, \( p(\mathbf{x} \mid \theta) \)—is often intractable or computationally expensive to evaluate. Simulation-Based Inference (SBI) methods overcome this obstacle by relying on forward simulations instead of explicit likelihood evaluations. By repeatedly simulating data under different parameters, SBI learns to infer posteriors \( p(\theta \mid \mathbf{x}) \) even when likelihoods are unavailable. ...

[Amortized Probabilistic Conditioning for Optimization, Simulation and Inference 🔗](https://arxiv.org/abs/2410.15320)

ACE: The One Transformer Model for Vision, Optimization, and Scientific Simulation

Modern machine learning thrives on the idea of amortization—training large models once so they can be applied instantly to many new problems. Pre-trained models like GPT-4 or Stable Diffusion embody this principle: by learning general structures from vast data, they enable fast adaptation to diverse tasks. Transformer-based architectures such as Neural Processes extend this notion to probabilistic meta-learning, allowing uncertainty-aware predictions across different domains. Yet, these methods face a major limitation: rigidity. Most models are constrained to tasks of the form “given X, predict Y.” Real-world problems are rarely so simple—sometimes we may know partial data and have beliefs about hidden parameters, and we want to predict both unseen data and those latent quantities. Traditional approaches rarely permit dynamic incorporation of such knowledge (so-called priors) during inference. For instance, in Bayesian optimization we seek the location and value of a minimum, while in scientific modeling we infer simulator parameters. Each case typically demands a bespoke, computationally expensive solution. ...

[ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion 🔗](https://arxiv.org/abs/2503.24354)

Generating AI Brains on Demand: How ORAL Crafts LoRA Adapters for Evolving LLMs

Large Language Models (LLMs) such as LLaMA, Mistral, and GPT evolve rapidly, releasing increasingly capable versions every few months. This pace of innovation is thrilling—but for developers and researchers, it introduces a major pain point. After days or weeks of fine-tuning a model for a specific task, a new version drops, rendering your painstaking work obsolete. To benefit from the improved base model, you must start over with expensive retraining. ...

[Bag of Tricks for Inference-time Computation of LLM Reasoning 🔗](https://arxiv.org/abs/2502.07191)

Beyond the Training Loop: Unlocking LLM Reasoning with Inference-Time Tricks

Large Language Models (LLMs) have become astonishingly capable, solving problems once reserved for human experts—in mathematics, coding, and scientific reasoning. Traditionally, we improve these models by scaling them up and retraining on ever-larger datasets, a process that demands immense computational resources. But what if we could make an existing model think better without retraining at all? That idea lies at the heart of inference-time computation. Inspired by human reasoning, this technique gives models more time and computation during testing—letting them “pause and deliberate” before deciding. These methods can significantly boost reasoning performance without touching a single model parameter. ...

[Neural Methods for Amortized Inference 🔗](https://arxiv.org/abs/2404.12484)

Train Once, Infer Forever: A Deep Dive into Amortized Neural Inference

Statistical inference turns data into decisions. Whether estimating the transmission rate of a disease, calibrating a physical simulator, or quantifying the uncertainty around climate model parameters, inference sits at the heart of scientific discovery. Traditional tools like Markov chain Monte Carlo (MCMC) give us asymptotically exact answers but can be painfully slow: every new dataset often requires re-running an expensive optimization or sampler. Amortized neural inference trades a one-time upfront cost for the ability to answer many future inference queries almost instantly. Train a neural network once on simulated data and then reuse that trained model to produce point estimates, posterior approximations, or likelihood surrogates for new observed datasets in milliseconds. This post distills the ideas from the review paper “Neural Methods for Amortized Inference” (Zammit-Mangion, Sainsbury-Dale & Huser) and explains the main concepts, methods, and practical considerations. ...

[PROMPTBREEDER: SELF-REFERENTIAL SELF-IMPROVEMENT VIA PROMPT EVOLUTION 🔗](https://arxiv.org/abs/2309.16797)

Promptbreeder: How LLMs Teach Themselves to Become Better Problem Solvers

Large language models (LLMs) are incredibly powerful, but unlocking their full potential often depends on a mysterious art: prompt engineering. A slight change in wording, a different instruction, or a new example can transform an incoherent answer into a masterpiece. Techniques like Chain-of-Thought (CoT) prompting, where you ask the model to “think step by step,” show that the right prompt strategy can dramatically improve an LLM’s reasoning ability. But this raises a deeper question: if LLMs are so smart, why do humans still need to craft prompts for them? Shouldn’t the models themselves figure out how to prompt effectively? ...

[Implicit Reasoning in Large Language Models: A Comprehensive Survey 🔗](https://arxiv.org/abs/2509.02350)

Beyond Chain-of-Thought: Unpacking the Silent Reasoning of LLMs

Beyond Chain-of-Thought: Unpacking the Silent Reasoning of LLMs If you’ve used large language models (LLMs) such as GPT-4 or Llama 3, you’ve probably seen Chain-of-Thought (CoT) prompting: ask a hard question, and the model walks you through intermediate steps before giving a final answer. That explicit, verbalized reasoning dramatically improves performance on many multi-step tasks, from math to commonsense puzzles. But CoT has a cost: generating every intermediate token is slow, expensive, and sometimes unnecessary. What if the model could perform the same multi-step reasoning internally—“thinking silently”—and only output the final answer? That’s the core of implicit reasoning, an active research direction that aims to preserve deep reasoning ability while reducing latency, cost, and verbosity. ...

[ICLR: IN-CONTEXT LEARNING OF REPRESENTATIONS 🔗](https://arxiv.org/abs/2501.00070)

Beyond Pretraining: How LLMs Remap Their 'Brains' On-the-Fly

Large Language Models (LLMs) like Llama 3 or GPT-4 seem to have an encyclopedic knowledge of the world. Through pretraining on massive text datasets, they learn that “apple” is a fruit, “Monday” precedes “Tuesday,” and a “car” is a vehicle. These relationships form a sprawling semantic map inside the model—a representation space that encodes how words relate to each other. But what happens when we challenge that map? What if, just for a single prompt, we tell the model that apple is now next to car, and bird is linked to milk? Can the model temporarily rewire its internal understanding and adopt a completely new semantic reality based purely on context? ...

[HyperAdaLoRA: Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance 🔗](https://arxiv.org/abs/2510.02630)

HyperAdaLoRA: A Hypernetwork-Powered Upgrade for Faster, Smarter LLM Fine-Tuning

Fine-tuning large language models (LLMs) is a double-edged sword. On one hand, it unlocks their potential for specialized tasks. On the other, it demands immense computational power and memory. This tension led to the rise of Parameter-Efficient Fine-Tuning (PEFT) — methods designed to adapt large models without retraining billions of parameters from scratch. Among these PEFT methods, LoRA (Low-Rank Adaptation) stands out for its simplicity and effectiveness. Instead of updating every weight in the model, LoRA freezes the original weights and injects small, trainable “adapter” matrices. This drastically reduces both the number of trainable parameters and the required compute. However, LoRA uses a fixed rank for its adapter matrices across the entire model — a one-size-fits-all strategy that ignores the varying importance of different layers. ...

[Self-Adapting Language Models 🔗](https://arxiv.org/abs/2506.10943)

Teaching LLMs to Teach Themselves: A Deep Dive into Self-Adapting Language Models (SEAL)

Large Language Models (LLMs) power much of today’s AI revolution. Trained on enormous amounts of text, they can reason, code, and create content. Yet they share a significant limitation: they’re static. Once trained, their knowledge is fixed—like a textbook printed last year. When faced with new information, these models can’t easily absorb updates or refine their reasoning without expensive and carefully orchestrated finetuning. But what if a model could learn how to learn? Imagine a student preparing for an exam. The student doesn’t just reread the textbook—they take notes, summarize, and rewrite concepts in their own words. This act of restructuring and self-teaching makes learning far more effective. Could we enable LLMs to do something similar? ...