Can LLMs Learn from Experience? Inside the LLM-Evolve Framework

Introduction

Imagine you are taking a difficult math exam. On the first question, you struggle, make a guess, and get it wrong. But immediately after, you are shown the correct solution. When you encounter a similar problem five questions later, you recall that solution, apply the logic, and get it right. You are learning from experience.

Now, consider how we evaluate Large Language Models (LLMs). We typically use benchmarks like MMLU or GSM8K, where models answer thousands of questions. However, in these standard evaluations, every question is treated as an isolated event. The model doesn’t “remember” solving question #1 when it tackles question #100. It doesn’t get to learn from its successes or failures during the test.

This creates a significant gap between evaluation and reality. In real-world applications—such as coding agents or chatbots—models interact continuously with their environment. They should improve over time as they accumulate history.

Researchers from NVIDIA recently published “LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks” to address this disconnect. They propose a new framework that transforms static benchmarks into dynamic, sequential challenges. By giving models a memory of their past successes, LLM-Evolve tests whether an AI can improve its performance on the fly, without updating its weights.

In this post, we will break down how LLM-Evolve works, the mathematics behind the memory mechanism, and the surprising results when state-of-the-art models are put to the test.

The Problem: Static vs. Dynamic Evaluation

To understand why LLM-Evolve is necessary, we first need to look at the limitations of current benchmarks.

The “I.I.D.” Assumption

Most LLM benchmarks operate under the assumption that tasks are Independent and Identically Distributed (i.i.d.). This means:

Independent: The result of one task does not influence another.
Identically Distributed: The tasks are drawn from the same general pool of difficulty and type.

In a standard evaluation loop, the model is given a prompt (often with a few fixed examples, known as “few-shot demonstrations”) and asked to generate an answer. The model is then reset, and the next question is asked. The model has no persistent memory of the session.

The Real-World Mismatch

This static approach fails to capture the “agentic” nature of modern AI. When an LLM operates as an agent, it performs actions, receives feedback (e.g., a code compiler error or a user correction), and iterates. The core capability required here is evolution—the ability to leverage past interactions to solve new problems.

The researchers behind LLM-Evolve asked a critical question: Can we adapt existing, high-quality benchmarks to test this evolving capability without creating entirely new datasets?

The LLM-Evolve Framework

The core innovation of this paper is a framework that wraps around existing benchmarks (like MMLU or GSM8K) to enable sequential learning. It introduces two key components to the standard evaluation pipeline: Feedback and Demonstration Memory.

Visualizing the Pipeline

The process is best understood visually. The model goes through multiple “rounds” of evaluation.

Overview of LLM-Evolve pipeline. The diagram shows the flow from Round 0 (Standard) to Round 1. In Round 0, inputs go to the LLM, outputs are checked against feedback, and successes are stored in Demo Memory. In Round 1, the system retrieves relevant examples from this memory to use as new few-shot prompts.

As shown in Figure 1, the pipeline operates in a cycle:

Generate: The LLM produces an answer for a current problem.
Feedback: The environment evaluates the answer (True/False).
Memory: If the answer is correct (Positive Feedback), the input-output pair is saved to a “Demonstration Memory.”
Retrieve: When the model faces a new problem in the next round, it queries this memory to find the most relevant past successes to use as examples.

The Mathematics of Memory

Let’s break down the method mathematically to see exactly how the prompt changes.

1. Standard Benchmark Setting (Round 0)

In a standard setting, the model \(p_{\theta}\) predicts output \(y\) based on an input \(x\). It is often guided by a fixed set of demonstrations provided by the benchmark developers.

Equation 1: Standard LLM inference where the model generates output based on input x and a fixed set of few-shot demonstrations.

Here, \(\{x_i^{demo}, y_i^{demo}\}\) represents that static set of examples. They never change, regardless of the input \(x\) or the model’s history.

2. Building the Demonstration Memory

In LLM-Evolve, the system maintains a memory bank, denoted as \(\mathcal{D}\). This memory stores the history of interactions.

Equation 2: The definition of Demonstration Memory D, consisting of tuples containing input, output, and feedback.

Each entry in the memory consists of:

\(x_i^{lm}\): The problem the model faced.
\(y_i^{lm}\): The answer the model generated.
\(f_i\): The binary feedback (True/False).

In this specific study, the researchers configured the system to only save positive experiences. If the model gets a question wrong, it is discarded. If it gets it right, that success is stored to help with future problems.

3. Retrieval-Augmented Inference

This is the most critical step. In subsequent rounds (Round 1, 2, etc.), the model no longer relies on the generic, fixed examples. Instead, it looks into its own memory \(\mathcal{D}\).

The system uses a Dense Retriever (denoted as \(r_{\phi}\)), such as Contriever or BERT. It converts the new question \(x\) into a vector and searches the memory for the “nearest neighbors”—past problems \(x_j^{lm}\) that are semantically similar to the current one.

Equation 5: The LLM-Evolve inference equation. It shows that the model uses demonstrations retrieved from memory D based on the lowest distance (highest similarity) to the current input x.

The top-k most similar successful examples are retrieved and inserted into the prompt. This creates a dynamic few-shot prompt tailored specifically to the current problem, consisting of solutions the model itself has previously verified as correct.

4. Handling Multi-Turn Conversations

The framework isn’t limited to simple Q&A. For benchmarks like AgentBench, where solving a task involves a back-and-forth conversation (e.g., using a Linux terminal), the memory stores the entire trajectory.

Equation 6: The memory structure for multi-turn settings, storing the sequence of inputs and outputs over multiple turns along with the final feedback.

This allows the model to recall complex workflows it has successfully executed in the past.

Experiments and Results

The researchers applied LLM-Evolve to three distinct benchmarks using a variety of open-source (Llama, Mistral, Qwen) and closed-source (GPT-3.5, GPT-4) models.

1. General Knowledge (MMLU)

MMLU is a massive benchmark covering 57 subjects, from elementary math to professional law.

Table 1: Results on MMLU showing consistent accuracy gains across various models. Larger models show smaller relative gains compared to smaller models.

Key Takeaways from Table 1:

Consistent Gains: Every single model improved when allowed to learn from experience.
The “First Round” Jump: The biggest improvement happens between the Standard setting and Round 1. Subsequent rounds (2 and 3) offer diminishing returns.
Model Size Correlation: Interestingly, larger models benefit less. Llama2-7B gained nearly 6%, while the massive Qwen2-72B only gained 1.33%.
Why? The authors hypothesize that larger models already store extensive problem-solving knowledge in their weights. Smaller models rely more heavily on the context provided in the prompt to “guide” them to the right answer.

2. Complex Reasoning (AgentBench & GSM8K)

The results become even more dramatic when the tasks require multi-step reasoning or tool use, rather than just knowledge retrieval.

AgentBench (OS Tasks) involves writing Linux scripts to solve operating system problems.

Table 2: Results on AgentBench showing significant improvements. Llama3-70B notably outperforms GPT-4’s standard score after evolution.

Key Takeaways from Table 2:

Massive Gains: The gains here are much higher than on MMLU (up to 12.5%).
Beating the Giant: Look at Llama3-70B. Its standard score (32.6%) is far below GPT-4’s standard score (43.8%). However, after 3 rounds of LLM-Evolve, Llama3-70B reaches 45.1%, surpassing the baseline GPT-4. This proves that a weaker model with a good memory mechanism can outperform a stronger model without one.

GSM8K evaluates grade-school math word problems.

Table 3: Results on GSM8K showing the highest gains among all benchmarks, with Llama3-8B gaining over 17% accuracy.

Key Takeaways from Table 3:

Highest Improvement: This benchmark saw the largest gains, with Llama3-8B improving by over 17%.
Reasoning requires Context: Math problems rely heavily on following a correct process. Retrieving a similar, correctly solved math problem provides a template for the logic required, which is incredibly valuable for the model.

What Components Matter Most?

The researchers performed an ablation study (removing parts of the system to see what breaks) using Llama3-8B on MMLU.

Table 4: Ablation study results. Switching retrievers has little impact, but removing the dense retrieval or using lower-quality feedback significantly hurts performance.

Insights from Table 4:

Retriever Choice: Switching from Contriever to BERT had almost no impact. As long as you have a decent retriever, the system works.
Feedback Quality is Critical: When the researchers replaced “Ground Truth” feedback with feedback generated by another LLM (Llama-70B), performance dropped by about 2%. The memory is only as good as the accuracy of the examples stored in it. If the model “hallucinates” a success and stores it, it poisons its own well.
Retrieval is Essential: Randomly shuffling prompts (instead of retrieving relevant ones) provided some benefit, but significantly less than the full retrieval approach.

Conclusion and Implications

The LLM-Evolve framework demonstrates that Large Language Models are not static entities; they possess a latent “evolving capability” that is often ignored by standard benchmarks.

By simply allowing a model to remember its winners—and retrieving those winners when facing similar new problems—we can unlock performance gains of up to 17% without a single gradient update.

Why does this matter for the future?

Deployment Strategies: It suggests that for real-world applications, building a dynamic “demonstration memory” (a database of successful user interactions) is a cost-effective way to improve performance over time, perhaps even more effective than constant fine-tuning.
Evaluation Standards: We need to move beyond i.i.d. benchmarks. To truly judge an agent, we must judge how well it learns.
Small vs. Large Models: The findings suggest that smaller, more efficient models equipped with a robust memory system could compete with much larger models, making AI more accessible and cheaper to run.

LLM-Evolve bridges the gap between static testing and dynamic reality, offering a glimpse into how AI agents might learn and adapt in the wild.

Introduction#

The Problem: Static vs. Dynamic Evaluation#

The “I.I.D.” Assumption#

The Real-World Mismatch#

The LLM-Evolve Framework#

Visualizing the Pipeline#

The Mathematics of Memory#

1. Standard Benchmark Setting (Round 0)#

2. Building the Demonstration Memory#

3. Retrieval-Augmented Inference#

4. Handling Multi-Turn Conversations#

Experiments and Results#

1. General Knowledge (MMLU)#

2. Complex Reasoning (AgentBench & GSM8K)#

What Components Matter Most?#

Conclusion and Implications#