[Dream 7B: Diffusion Large Language Models 🔗](https://arxiv.org/abs/2508.15487)

Beyond Left-to-Right: Introducing Dream 7B, a Powerful New Diffusion LLM

For years, large language models (LLMs) have relied on a single fundamental idea: autoregression. Models such as GPT-4, LLaMA, and Qwen generate text one word at a time, moving from left to right—much like how a person might write a sentence. This approach has driven remarkable progress, but it also carries inherent limitations. When a model can only see the past, it struggles with tasks requiring global consistency, long-term planning, or satisfying complex constraints. ...

2025-08 · 7 min · 1473 words
[Linear Transformers are Versatile In-Context Learners 🔗](https://arxiv.org/abs/2402.14180)

Beyond Gradient Descent: How Transformers Discover Their Own Optimization Algorithms

Transformers have taken the world by storm, powering everything from ChatGPT to advanced code completion tools. One of their most magical abilities is in-context learning (ICL) — the power to learn from examples provided in their input prompt, without any weight updates. If you show a large language model a few examples of a task, it can often perform that task on a new example instantly. For a long time, how this works has been a bit of a mystery. Recent research has started to peel back the layers, suggesting that for simple tasks like linear regression, transformers internally run a form of gradient descent. Each attention layer acts like a single optimization step, refining an internal “solution” based on the data in the prompt. ...

2024-02 · 7 min · 1311 words
[Transformers are RNNs - Fast Autoregressive Transformers with Linear Attention 🔗](https://arxiv.org/abs/2006.16236)

Making Transformers Fly - A Deep Dive into Linear Attention

Making Transformers Fly: A Deep Dive into Linear Attention Since their introduction in the landmark 2017 paper Attention Is All You Need, Transformers have taken the world of AI by storm. Models like BERT, GPT-3, and DALL·E have revolutionized natural language processing, computer vision, and beyond. They are the engines behind the generative AI boom—capable of writing code, creating art, and holding surprisingly coherent conversations. But these powerful models have a costly secret: a computational bottleneck that has, until recently, put a hard limit on how much information they can handle at once. The core of the Transformer, the self-attention mechanism, has a computational and memory complexity of \(O(N^2)\), where \(N\) is the length of the input sequence. This means that if you double the length of the text or the number of pixels in an image you’re processing, the cost doesn’t just double—it quadruples. For very long sequences—like high-resolution images, lengthy documents, or audio clips—this quadratic scaling becomes prohibitively expensive. ...

2020-06 · 8 min · 1560 words
[Test-Time Training with Self-Supervision for Generalization under Distribution Shifts 🔗](https://arxiv.org/abs/1909.13231)

Don’t Just Test — Train! Adapting to New Data on the Fly with Self-Supervision

You’ve spent weeks training a state-of-the-art image classifier. It achieves near-perfect accuracy on your test set, and you’re ready to deploy it. But when it encounters real-world data—a blurry photo from an old phone, an image taken on a foggy day, or a frame from a shaky video—its performance drops dramatically. Sound familiar? This is one of the most persistent challenges in machine learning: the distribution shift. Models that excel on clean, curated training data often buckle when faced with test data that, while semantically similar, has different statistical properties. The standard machine learning paradigm assumes that training and test data are drawn from the same independent and identically distributed (i.i.d.) source—an assumption that the real world frequently violates. ...

2019-09 · 8 min · 1587 words
[Beyond Model Adaptation at Test Time: A Survey 🔗](https://arxiv.org/abs/2411.03687)

When Models Meet Reality: The Ultimate Guide to Test‑Time Adaptation

Machine learning models are powerful pattern detectors. They excel when the world at test time looks like the world they saw during training. But in practice the world rarely cooperates. A self-driving car trained on sunny roads will struggle in snow. A medical imaging model trained in one hospital may fail on data from another. This mismatch—known as distribution shift—is one of the biggest obstacles to reliable, real-world AI. The traditional remedy is retraining: collect new data and update the model. But that’s often impractical or slow. Test-Time Adaptation (TTA) takes a different view: let the model adapt on the fly during inference. The recent survey “Beyond Model Adaptation at Test Time” organizes over 400 papers and shows that adaptation is not just about fine-tuning model weights. Researchers adapt many components of the prediction pipeline: the model, the inference procedure, normalization layers, the input samples themselves, and even the prompts fed to large foundation models. ...

2024-11 · 12 min · 2453 words
[Learning to (Learn at Test Time): RNNs with Expressive Hidden States 🔗](https://arxiv.org/abs/2407.04620)

RNNs Are Back? How Making Hidden States into Learners Unlocks Long-Context Potential

Recurrent Neural Networks (RNNs) have long lived in the shadow of Transformers. Transformers dominate modern sequence modeling because they can effectively use long contexts—predicting future tokens becomes easier the more history they have to condition on. The drawback is their quadratic complexity, which makes them slow and memory-hungry for long sequences. RNNs, in contrast, have linear complexity, but have historically struggled to take advantage of more context. As the 2020 OpenAI scaling law paper famously showed, classic RNNs like LSTMs could not scale or leverage long-context data as Transformers did. But today’s modern RNN architectures, such as Mamba, are closing the gap. Or so it seemed—until researchers discovered that even the best RNNs plateau when sequence lengths grow very long. ...

2024-07 · 8 min · 1637 words
[TTT-UNet: Enhancing U-Net with Test-Time Training Layers for Biomedical Image Segmentation 🔗](https://arxiv.org/abs/2409.11299)

Beyond Static Models: How TTT-UNet Adapts on the Fly for Superior Medical Image Segmentation

Introduction: The Challenge of Seeing the Whole Picture In medical diagnostics, clarity is everything. Medical image segmentation—the process of outlining organs, tissues, or cells in medical imagery—is central to understanding disease progression and guiding surgical decisions. Over the past decade, Convolutional Neural Networks (CNNs), particularly the famed U-Net architecture, have been instrumental in achieving precise segmentation across numerous applications. Yet despite their success, CNNs have a key limitation: they “see” the world through small, localized windows known as kernels. That makes them excellent at capturing fine textures but poor at understanding the global structure of images—the large-scale relationships between distant regions. Imagine trying to comprehend a whole-body CT scan through a magnifying glass. You’d see details beautifully, but miss how organs connect. ...

2024-09 · 7 min · 1350 words
[Unexpected Benefits of Self-Modeling in Neural Systems 🔗](https://arxiv.org/abs/2407.10188)

The Self-Awareness Paradox: How Teaching Neural Networks to Model Themselves Makes Them Simpler

What if making an AI self-aware didn’t just help it understand itself—but fundamentally changed it for the better? In cognitive science, we’ve long known that humans rely on self-models: our body schema that tracks limbs in space, and our metacognition, the ability to think about our own thoughts. Such predictive self-models help the brain control and adapt its behavior. But what happens when we give a similar ability to neural networks? ...

2024-07 · 7 min · 1446 words

Supercharge Your Transformer: How One Gradient Step at Test Time Makes In-Context Learning Way More Efficient

Introduction: The Adaptation Puzzle Large language models (LLMs) and other foundation models have revolutionized AI. Their most striking ability is in-context learning (ICL)—you can show a model a few examples of a new task right in the prompt, and it can often figure out how to solve it without updating its internal weights. It’s like a student learning from a handful of practice problems just before an exam. But what happens when the exam questions are particularly tricky or on a topic the student barely studied? The model, like the student, might stumble. This is a classic case of distribution shift, when the test data look different from the training data, and the pre-trained model fails to generalize. ...

10 min · 1991 words
[DeepPrune: Parallel Scaling without Inter-trace Redundancy 🔗](https://arxiv.org/abs/2510.08483)

Wasted Work: How DeepPrune Slashes LLM Reasoning Costs by Over 80%

Large Language Models (LLMs) have become remarkably good at complex reasoning tasks—solving advanced math problems, writing structured code, and answering graduate-level science questions. One of the central techniques powering this intelligence is parallel scaling, where a model generates hundreds of independent reasoning paths (or Chains of Thought, CoT) for the same problem and selects the most consistent final answer—typically through majority voting. Think of it as a giant brainstorming session: the model explores dozens of ways to solve a problem, then decides which solution seems most reliable. ...

2025-10 · 7 min · 1480 words
[Vibe Checker: Aligning Code Evaluation with Human Preference 🔗](https://arxiv.org/abs/2510.07315)

More Than Just Correct: Why Your AI Coding Assistant Needs a 'Vibe Check'

If you’ve ever used an AI coding assistant like GitHub Copilot, you’ve probably engaged in what researchers now call “vibe coding.” You don’t just ask for code once—you have a conversation. You might start with a basic request, then refine it: “Okay, that works, but can you rewrite it using a for loop instead of recursion?” or “Add some comments, and make sure all lines are under 80 characters.” You keep tweaking until the code not only functions but also feels right. It passes your personal “vibe check.” That feeling of “rightness” goes beyond logic—it includes readability, consistency with project style, minimal edits, and following nuanced, non-functional requests. ...

2025-10 · 8 min · 1524 words
[Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs 🔗](https://arxiv.org/abs/2406.09136)

Beyond Chain-of-Thought: How CPO Makes LLMs Smarter Without Slowing Them Down

Large Language Models (LLMs) have become remarkably adept at tackling complex problems—from writing code to solving intricate math questions. A cornerstone of this success is their ability to “think out loud” through Chain-of-Thought (CoT) reasoning. By generating intermediate steps, an LLM can break down a problem logically and arrive at a more accurate solution. However, CoT has a fundamental limitation: it’s like someone blurting out the first train of thought that comes to mind. The reasoning follows a single, linear path that might not always be the best or most logical one. This can lead to errors in tasks requiring deeper, multi-step deliberation. ...

2024-06 · 8 min · 1518 words
[Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy 🔗](https://arxiv.org/abs/2410.09873)

The Secret to Faster Diffusion Models: How AdaptiveDiffusion Skips Steps Intelligently

Diffusion models like Midjourney, Stable Diffusion, and Sora have transformed how we create digital art, videos, and realistic images from simple text prompts. They power a new generation of creative tools—but they share one major limitation: speed. Generating a single high-resolution image with a model like SDXL can take tens of seconds, making real-time or interactive applications cumbersome. Why are they so slow? It all comes down to their core mechanism. Diffusion models start from pure noise (think of TV static) and gradually refine this noise into a coherent image through dozens or even hundreds of steps. At each step, a large neural network—called the noise predictor—estimates how much noise remains to be removed. Running this heavy network repeatedly dominates the computation time. ...

2024-10 · 6 min · 1251 words
[Robust Sparse Regression with Non-Isotropic Designs 🔗](https://arxiv.org/abs/2410.23937)

Taming Two Adversaries: A Breakthrough in Robust Sparse Regression

Introduction: The Messy Reality of Big Data Linear regression is one of the foundations of modern statistics and machine learning. The idea is simple: fit a line (or a plane) that best captures the relationship between input variables and an output. But simplicity ends where real-world data begins — and real data is rarely clean or low-dimensional. In practice, we deal with enormous, high-dimensional datasets that are messy, noisy, and often contain outliers. ...

2024-10 · 10 min · 1973 words
[Random Policy Enables In-Context Reinforcement Learning within Trust Horizons 🔗](https://arxiv.org/abs/2410.19982)

Unlocking In-Context Reinforcement Learning with Random Data — A Deep Dive into State-Action Distillation (SAD)

Foundation models like GPT have demonstrated an astonishing ability called in-context learning—the capacity to adapt to new tasks purely from examples, without updating any model parameters. This breakthrough has reshaped modern machine learning across language, vision, and multimodal domains. Now, researchers are extending this power to decision-making systems, spawning a new frontier known as In-Context Reinforcement Learning (ICRL). The goal is simple but ambitious: build a pretrained agent that can enter a new, unseen environment and quickly learn how to act optimally by using its recent experiences—state, action, and reward tuples—as contextual hints. No gradient updates, no fine-tuning—just pure inference-driven learning. ...

2024-10 · 8 min · 1558 words
[Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision 🔗](https://arxiv.org/abs/2411.16579)

Training an LLM to Be Its Own Toughest Critic

Large Language Models (LLMs) have become astonishingly good at sounding human, but when it comes to complex, multi-step reasoning—say, solving a tricky math problem or debugging a program—they often stumble. One common fix is to simply give the model more “thinking time” during inference: let it generate multiple answers and choose the best. The problem? If the model isn’t good at judging its own work, it will just produce lots of wrong answers faster. Like asking a student who doesn’t understand algebra to solve a hundred equations—they’ll make the same mistakes over and over. ...

2024-11 · 9 min · 1773 words
[A Survey of Few-Shot Learning on Graphs: from Meta-Learning to Pre-Training and Prompt Learning 🔗](https://arxiv.org/abs/2402.01440)

Learning from Scraps – A Deep Dive into Few-Shot Learning on Graphs

Graph-structured data is everywhere. From social networks connecting billions of users to intricate molecular structures and vast knowledge graphs, our world is built on relationships. Graph Neural Networks (GNNs) have become the go-to tool for learning from this data, powering everything from recommendation engines to drug discovery. But these powerful models have an insatiable appetite — they thrive on data, particularly labeled data. They achieve state-of-the-art performance only when fed a mountain of labeled examples. What happens when those labels are scarce? What if you’re dealing with an emerging category, a rare disease, or a brand-new user on your platform? ...

2024-02 · 8 min · 1577 words
[A Tutorial on Meta-Reinforcement Learning 🔗](https://arxiv.org/abs/2301.08028)

Learning to Learn: A Deep Dive into Meta‑Reinforcement Learning

A publisher banner for Foundations and Trends® in Machine Learning. Meta-reinforcement learning (meta-RL) asks a deceptively simple question: can we train an agent that learns faster than its base learner by learning how to learn from data? Put another way, instead of designing a single algorithm to solve one task, can we design an algorithm that itself becomes a data-driven learning procedure—so that when faced with a new task it adapts rapidly and efficiently? ...

2023-01 · 11 min · 2132 words
[Domain Generalization through Meta-Learning: A Survey 🔗](https://arxiv.org/abs/2404.02785)

Learning to Generalize: How Meta-Learning Is Cracking the Code of Domain Generalization

Deep learning models are incredible. They can identify cats in photos, translate languages in real-time, and even help doctors diagnose diseases. But they have a critical weakness: they are often brittle. Train a model on pristine, studio-quality images, and it might fail spectacularly when shown a blurry, real-world photo taken on a smartphone. This is the out-of-distribution (OOD) problem, and it’s one of the biggest hurdles to building truly reliable and adaptive AI. ...

2024-04 · 9 min · 1735 words
[A Survey to Recent Progress Towards Understanding In-Context Learning 🔗](https://arxiv.org/abs/2402.02212)

How Do LLMs Learn on the Fly? A Deep Dive into In-Context Learning

Large Language Models (LLMs) like GPT-4 and Claude have a seemingly magical ability: show them just a few examples of a task in a prompt—say, two labeled sentences or snippets of code—and they can instantly perform that task on new data. This capability, known as In-Context Learning (ICL), allows them to translate languages, analyze sentiment, or even write algorithms with just a handful of demonstrations, all without any updates to their underlying weights. ...

2024-02 · 7 min · 1437 words