[Shortcut Learning in In-Context Learning: A Survey 🔗](https://arxiv.org/abs/2411.02018)

Beyond the Prompt: Unpacking Shortcut Learning in Large Language Models

Large Language Models (LLMs) such as GPT-3, LLaMA, Qwen2, and GLM have revolutionized how humans interact with technology. Among their many capabilities, In-Context Learning (ICL) stands out as particularly intriguing—it allows them to learn to perform a new task simply by observing a few examples within a prompt, no retraining required. It feels almost magical. But what if this “magic” sometimes hides a clever illusion? LLMs often take the path of least resistance. Instead of grasping the reasoning we expect, they find simple shortcuts that seem to work—until they don’t. This phenomenon, known as shortcut learning, reveals that these models can overfit to shallow patterns rather than genuine logic. It’s reminiscent of Clever Hans, the horse thought to understand arithmetic but that really just responded to subtle cues from its handler. ...

2024-11 · 7 min · 1488 words
[The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis 🔗](https://arxiv.org/abs/2311.00237)

Unlocking the Black Box: A Deep Dive into How LLMs Learn on the Fly

If you’ve ever used ChatGPT, Llama, or any other modern Large Language Model (LLM), you’ve witnessed a kind of magic. You can show it a few examples of a task—like translating phrases from English to French or classifying movie reviews as positive or negative—and it instantly gets it. Without any retraining or fine-tuning, it can perform the task on new inputs. This remarkable ability is called In-Context Learning (ICL), and it’s one of the key reasons why LLMs are so powerful and versatile. As shown in Figure 1, you can provide examples for translation, sentiment analysis, or even math, and the same model can handle them all—adapting its behavior dynamically based on the context given in the prompt. ...

2023-11 · 9 min · 1744 words
[CLUST3: INFORMATION INVARIANT TEST-TIME TRAINING 🔗](https://arxiv.org/abs/2310.12345)

ClusT3: Adapting to the Unknown with Information-Invariant Clustering

You’ve trained a state-of-the-art image classifier. It hits 95% accuracy on the test set, and you’re ready to deploy it. Then it meets the real world—grainy photos, foggy mornings, and messy camera angles—and performance plummets. Your model, brilliant in the lab, proves brittle in the wild. This is the problem of domain shift, one of the most enduring challenges in modern machine learning. Models trained in one environment (the source domain) often fail when deployed in a new, unseen one (the target domain). How do we make our models robust to these shifts—without collecting massive labeled datasets for every possible scenario? ...

2023-10 · 8 min · 1596 words
[A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts 🔗](https://arxiv.org/abs/2303.15361)

When Your Model Meets the Real World — A Deep Dive into Test‑Time Adaptation

When Your Model Meets the Real World — A Deep Dive into Test-Time Adaptation Imagine training a state-of-the-art vision model that performs flawlessly in the lab, then deploying it in the wild and watching its accuracy collapse as lighting, sensor, or environment change. This brittle behavior—the result of distribution shift between training and test data—has pushed researchers to ask: can a model learn while it’s being used? Test-Time Adaptation (TTA) answers with a resounding “yes.” Instead of trying to build a single model that handles every possible scenario, TTA adapts a pre-trained model to unlabeled test data at inference time. It keeps the model lightweight and private (no need to ship training data), and it leverages the very data the model will encounter in production. ...

2023-03 · 13 min · 2622 words
[A Survey on the Intersections of Meta-Learning, Online Learning, and Continual Learning 🔗](https://arxiv.org/abs/2311.05241)

Untangling the Web: A Guide to Meta, Online, and Continual Learning

If you’ve spent any time in the world of deep learning, you’re familiar with the standard recipe for success: gather a massive, static dataset, shuffle it thoroughly, and train a neural network for hours or days using mini-batch stochastic gradient descent (SGD). This offline, i.i.d. (independent and identically distributed) approach has powered incredible breakthroughs, from image recognition to language translation. But let’s be honest—it’s nothing like how humans learn. We don’t get a perfectly curated dataset of our entire lives upfront. We learn sequentially, from a continuous stream of experiences. We adapt to new information without completely forgetting the old. And perhaps most impressively, we get better at learning over time. We learn how to learn. ...

2023-11 · 9 min · 1822 words

Decoding the Brain's GPS: How Similarity Shapes Our Mental Maps

From birds migrating across continents to London cab drivers navigating 26,000 streets, the ability to find one’s way through the world is one of nature’s most extraordinary feats. At the heart of spatial navigation in mammals lies a specialized group of neurons in the hippocampus known as place cells. These neurons act like an internal “You Are Here” marker, firing only when an animal occupies a specific location in its environment. ...

7 min · 1309 words
[Enhancing Chess Reinforcement Learning with Graph Representation 🔗](https://arxiv.org/abs/2410.23753)

AlphaGateau: Training Chess Engines Faster and Smarter with Graphs

In 2017, the artificial intelligence community was mesmerized by AlphaZero. Developed by DeepMind, this single algorithm taught itself to play Go, Shogi, and Chess at a superhuman level—starting from nothing but the rules. It was a monumental achievement, demonstrating the raw power of deep reinforcement learning (RL). Yet behind the triumph was a significant limitation: AlphaZero and similar models are resource-intensive and structurally inflexible. These algorithms perceive a game board as a 2D grid of pixels, much like an image, and use Convolutional Neural Networks (CNNs) to process it—the same technology that powers modern image recognition. While effective, this design has drawbacks. A CNN trained to play Go on a 19×19 board cannot seamlessly play on a smaller 13×13 board; the architecture itself is hard-wired to a specific input size. This rigidity forces researchers to retrain models from scratch every time the game, or even just the board size, changes. ...

2024-10 · 8 min · 1621 words
[Improved Regret for Bandit Convex Optimization with Delayed Feedback 🔗](https://arxiv.org/abs/2402.09152)

The Power of Patience: How Blocking Updates Can Solve Delayed Bandit Feedback

Imagine you’re running an online advertising campaign. Every time a user visits a website, you must decide which ad to show them. Your goal is to maximize clicks, but you face two major challenges. First, you don’t know beforehand which ad a user will like—you only find out whether they clicked after the ad is shown. This is called bandit feedback: you only get information about the action you took, not the others you could have taken. Second, feedback can be delayed. A user might not click immediately, and the report confirming a click might take minutes, hours, or even days to arrive. ...

2024-02 · 8 min · 1529 words
[Fair Wasserstein Coresets 🔗](https://arxiv.org/abs/2311.05436)

Distilling Fairness: How Fair Wasserstein Coresets Tackle Bias in Big Data

Introduction: The Double-Edged Sword of Big Data We live in an era of data deluge. From social media feeds to scientific sensors, we generate and collect information faster than we can process it. For machine learning practitioners, this abundance is both a blessing and a curse. While large datasets can fuel highly accurate models, they also create computational bottlenecks—training models becomes slow, costly, and sometimes infeasible. A well-known remedy for this is dataset distillation or coreset generation. The idea is intuitive: instead of training on millions of records, we “distill” them into a few thousand representative samples that preserve the essence of the original. These smaller, weighted datasets—coresets—can dramatically reduce training time without sacrificing much performance. ...

2023-11 · 9 min · 1905 words

Beyond Feedforward: Teaching Neural Networks to Remember with Recurrent Multilayer Perceptrons

Introduction: Modeling the Unknowable Imagine trying to create a perfect digital twin of a complex chemical reactor or a power grid. These systems are governed by countless interacting physical processes—many of which are too intricate or poorly understood to be captured by neat mathematical equations. When building a model from first principles is impossible, engineers turn to a powerful alternative: system identification. The idea is straightforward. Instead of explaining how the system works internally, we build a “black box” model that simply learns to behave like the real thing. By feeding the same inputs into our model and comparing its outputs to those of the actual system, we can train it to replicate the system’s response. This empirical approach is indispensable for tasks such as monitoring system health, predicting failures, and designing adaptive controllers. ...

6 min · 1278 words
[LSTM: A Search Space Odyssey 🔗](https://arxiv.org/abs/1503.04069)

The Ultimate LSTM Showdown: A Deep Dive into 'A Search Space Odyssey'

If you’ve spent any time in the world of deep learning for sequential data, you’ve undoubtedly come across the Long Short-Term Memory network—better known as LSTM. Since their introduction, LSTMs have become the workhorse for tasks ranging from speech recognition and language translation to handwriting analysis and music generation. They are renowned for capturing long-range dependencies in data—a capability that their simpler predecessors, Simple Recurrent Networks (SRNs), often lacked. But here’s a question that might surprise you: what is an LSTM, really? It turns out “LSTM” isn’t a single, rigidly defined architecture. It’s more like a family of related models. Over the years, researchers have proposed numerous tweaks and variations: adding peephole connections, removing gates, coupling gates together, and more. As a result, the field has accumulated a kind of architectural zoo—where practitioners often rely on folklore or copy designs from well-known papers without fully understanding why certain components exist. ...

2015-03 · 8 min · 1556 words

Unlocking Continual Learning: How LSTMs Learned to Forget

Recurrent Neural Networks (RNNs) are the workhorses of sequence modeling. From predicting the next word in a sentence to forecasting stock prices, their ability to maintain an internal state, or “memory,” makes them uniquely suited for tasks where context is key. However, traditional RNNs have a notoriously short memory. When faced with long sequences, they struggle with a problem called the vanishing gradient, where the influence of past events fades away too quickly during training. ...

8 min · 1616 words
[A Theoretically Grounded Application of Dropout in Recurrent Neural Networks 🔗](https://arxiv.org/abs/1512.05287)

Why Your RNNs Overfit—and How to Fix It with Bayesian Dropout

Recurrent Neural Networks (RNNs) are the workhorses of modern sequence modeling. From translating languages and powering chatbots to analyzing video streams, their ability to process information that unfolds over time has transformed machine learning. Yet, for all their power, RNNs have a notorious weakness: they tend to overfit, especially when data is limited. For years, deep learning practitioners have fought overfitting with a simple but powerful technique known as dropout. During training, a certain fraction of neuron activations is randomly “dropped”—set to zero. This prevents neurons from becoming co-dependent and forces the model to learn robust, generalizable patterns. ...

2015-12 · 7 min · 1427 words
[Recurrent Neural Network Regularization 🔗](https://arxiv.org/abs/1409.2329)

The Simple Trick That Finally Made Dropout Work for RNNs

Recurrent Neural Networks (RNNs) are the workhorses of sequence modeling. From predicting the next word in a sentence to transcribing speech and translating languages, their ability to process information sequentially has transformed countless modern applications. But like all deep neural networks, they have an Achilles’ heel: overfitting. When a model overfits, it learns the training data too well—memorizing noise and quirks instead of general patterns. It performs spectacularly on training examples but falters on unseen data. In feedforward networks, the undisputed champion against overfitting is dropout. The idea is simple yet powerful: during training, randomly “turn off” a fraction of neurons to prevent co-dependency, forcing the network to learn more robust, distributed representations. ...

2014-09 · 7 min · 1390 words

The Broken Wire in Photosynthesis: Unraveling the Mystery of the Inactive D2 Branch

Photosynthesis is the single most important biological process on Earth. It’s our planet’s natural solar panel, converting sunlight into the chemical energy that fuels almost all living organisms. At the heart of this process lies a molecular machine known as Photosystem II (PSII)—the system responsible for splitting water and releasing oxygen. PSII is both beautifully symmetric and deeply puzzling. Its core, called the reaction center (RC), is constructed with near-perfect C2 symmetry and contains two potential pathways for charge transport, referred to as the D1 and D2 branches. Structurally, these two branches are mirror images. Yet, only one of them—the D1 branch—is active in shuttling electrons. The D2 branch, despite being its molecular twin, appears to be inert—a “broken wire.” ...

7 min · 1432 words

When Neural Networks Evolve Themselves: A New Model for Open-Ended Evolution

In evolution—whether biological or computational—we use models to understand how variation and selection create complex systems. But most computational models have a key limitation: their evolutionary rules are set by the programmer. The modeler determines how often mutations happen, what kinds of changes are allowed, and how they are distributed. That’s like studying a forest by only observing trees you planted yourself. What if mutation—the very engine of evolution—could emerge from within the system itself? ...

8 min · 1510 words
[Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning 🔗](https://arxiv.org/abs/2407.16920)

Train-Attention: Teaching LLMs to Focus on What Matters for Continual Learning

Large Language Models (LLMs) are astonishing knowledge systems—but they have a chronic flaw: the world keeps changing, and their internal knowledge often does not. When we attempt to teach them new facts—like a recently elected leader or a novel scientific discovery—they frequently suffer from catastrophic forgetting, where learning new information causes the loss of prior knowledge. It’s like pouring water into a full cup—the new water displaces what was already there. ...

2024-07 · 3 min · 561 words
[Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers 🔗](https://arxiv.org/abs/2510.11471)

Learning to Learn, One Batch at a Time: A Deep Dive into Iterative Amortized Inference

Imagine you’re an astrophysicist tasked with modeling the motion of objects on different planets. You could build a separate simulator for each planet—one for Earth, one for Mars, one for Jupiter. But that would be wasteful. The laws of physics are universal; only a single parameter, the gravitational constant, varies from planet to planet. A smarter strategy is to build one general model and adapt it to each planet by estimating its gravity from a few examples. ...

2025-10 · 7 min · 1387 words
[LONGHORN: STATE SPACE MODELS ARE AMORTIZED ONLINE LEARNERS 🔗](https://arxiv.org/abs/2407.14207)

Longhorn: Reimagining State Space Models as Online Learners

For years, the Transformer has been the undisputed champion of sequence modeling, powering everything from large language models like GPT to breakthroughs in scientific and multimodal AI. Yet even kings have weaknesses—Transformers struggle with efficiency. Their computational cost grows quadratically with sequence length, meaning that processing a book is vastly more expensive than processing a sentence. This limitation has become a severe bottleneck as researchers push toward models that can understand entire codebases, long conversations, or even persistent streams of sensory data. ...

2024-07 · 8 min · 1521 words
[Meta-Reinforcement Learning with Self-Modifying Networks 🔗](https://arxiv.org/abs/2202.02363)

Learning to Learn: How Self-Modifying Networks Unlock True AI Adaptability

Deep Reinforcement Learning (RL) has produced remarkable achievements—AI systems have mastered video games, navigated simulated worlds, and even rivaled human experts. Yet these success stories hide a critical weakness: specialization. Most RL agents excel only within the narrow boundaries of their training environments. Change the rules, the context, or the objective, and their performance collapses. They do not learn how to learn. Humans, in contrast, thrive in changing environments. We can adapt instantly—learning a new game in minutes, driving safely in unexpected conditions, or mastering a new gadget without instruction. This ability to abstract and transfer learning principles is one of intelligence’s defining features. The question is: how can we build machines that share this flexibility? ...

2022-02 · 8 min · 1672 words