“To err is human.” But as it turns out, to err is also a fundamental property of Large Language Models (LLMs).

When we try to teach an LLM a new task using In-Context Learning (ICL), we typically show it a few “gold standard” examples—perfect questions paired with perfect answers. This is like teaching a student math by only showing them the answer key. While effective, it misses a crucial component of the learning process: analyzing mistakes.

Just as a human student learns more from understanding why they failed a calculus problem than from seeing the correct solution, LLMs can improve significantly by understanding their own error patterns. Recent research has attempted to capitalize on this by generating “principles” from mistakes—rules of thumb to guide the model. However, these methods often suffer from being too generic (applying the same rules to every problem) or too narrow (failing to cover the wide variety of possible errors).

In this post, we dive into a new framework called Retrieved In-Context Principles (RICP). This method proposes a dynamic Teacher-Student architecture where models don’t just learn from correct answers—they actively retrieve specific, customized principles derived from past failures to ensure they don’t make the same mistake twice.

The Problem: Generic Advice for Specific Problems

Standard In-Context Learning (ICL) relies on positive reinforcement: “Here is how you solve this.”

Recent advancements have tried to introduce negative feedback. Methods like LEAP or TPD use a “Teacher” model to analyze a “Student” model’s mistakes and write a list of principles (e.g., “Always double-check your arithmetic”). The Student then uses these principles during inference.

While this is a step in the right direction, it introduces two major limitations:

  1. Lack of Customization: These methods often feed the same fixed set of principles into the prompt for every single question. If a question requires logical deduction, telling the model to “check arithmetic” is noise that confuses the model and wastes context window space.
  2. Inadequate Error Coverage: Because the principles are fixed, they can only cover a small subset of general errors. They fail to address the long tail of specific, nuanced mistakes that occur in complex reasoning tasks.

RICP solves this by making the principles dynamic. Instead of a static list of rules, RICP retrieves the specific advice relevant to the current question, drawn from a vast library of categorized mistakes.

The RICP Architecture

At its core, RICP is a pipeline that transforms raw error data into actionable, retrieved wisdom. It operates using a Teacher-Student framework:

  • The Student (e.g., GPT-3.5 or Qwen): The model we want to improve.
  • The Teacher (e.g., GPT-4): A stronger model that acts as a critic and guide.

The workflow is circular and iterative. The Teacher watches the Student fail, writes a “lesson plan” (principles), and the Student uses that plan to succeed on new data.

Inference pipeline of RICP showing the loop between Student and Teacher models.

As shown in Figure 1 above, the process begins with training data. When the Student makes a mistake during inference, the Teacher analyzes it to generate principles. Later, when a new question arrives, the relevant principles are retrieved to guide the Student toward the correct answer (the checkmark).

Let’s break this down into the three distinct stages of the methodology.

Stage 1: Insight Generation

The first step is the “post-mortem.” We need to generate a corpus of mistakes and the lessons learned from them.

We take a training dataset (\(D_{train}\)) containing questions (\(x\)) and correct answers (\(y\)). We ask the Student model to solve these questions. Naturally, it will get some wrong. We collect these failures into a “Negative Dataset” (\(D_{neg}\)).

Equation defining the negative dataset comprising questions where the predicted answer does not match the ground truth.

Here, \(\hat{r}\) represents the incorrect rationale (reasoning) the Student generated.

Next, the Teacher model looks at each mistake. It is provided with the question, the Student’s wrong reasoning, and the correct answer. The Teacher is asked to produce two things:

  1. Reason (\(R\)): A high-level categorization of the error (e.g., “Calculation Error” or “Misinterpretation of Context”).
  2. Insight (\(I\)): Specific, detailed advice on how to avoid this error in the future.

This creates an “Insight Corpus”:

Equation defining the insight corpus containing Reasons and Insights for each mistake.

Stage 2: Principle Formulation

Now we have a large pile of specific mistakes and insights. Simply dumping all of these into a prompt would be impossible due to context limits. We need to organize this knowledge.

RICP employs Hierarchical Clustering to organize these insights into two levels of abstraction: Task-Level and Question-Level.

Task-Level Principles

These are broad, universal rules that apply to the general type of task (e.g., “In math word problems, always verify units”).

To find these, the researchers cluster the high-level Reasons (\(R\)) generated in the previous step. They embed the text of the reasons and use K-means clustering.

Equation showing K-means clustering applied to Reason embeddings.

By grouping similar reasons, the system identifies common pitfalls across the entire dataset. The Teacher then summarizes each cluster into a general principle.

Question-Level Principles

These are specific tips tailored to the semantic content of the question (e.g., “When calculating the area of a circle, remember to square the radius”).

To organize these, the researchers first retrieve questions similar to the current target question. However, the insights for these similar questions might be repetitive. To solve this, they perform K-means clustering on the Insights (\(I\)) themselves.

Equation showing K-means clustering applied to Insight embeddings.

This ensures that the advice retrieved is diverse and covers different aspects of the problem, rather than repeating the same advice five times.

The entire flow, from the initial mistake to the formulation of hierarchical principles, is visualized below:

The 3-stage pipeline of RICP: Insight Generation, Principle Formulation, and Principle Utilization.

Stage 3: Principle Utilization

This is the inference stage—the “test.”

When a new question comes in:

  1. Task-Level Retrieval: The system injects the high-level principles relevant to the task type.
  2. Question-Level Retrieval: The system retrieves the \(m\) most similar previous questions from the database. It then clusters the insights associated with those questions and samples \(n\) distinct insights.
  3. Prompt Construction: The original question is combined with these retrieved principles (“Task-Level Principles” + “Question-Level Principles”) to form an enhanced prompt.

Crucially, the Teacher model is not needed during this inference stage. The principles have already been generated and stored. The retrieval is a lightweight process, making this method computationally efficient compared to methods that require a Teacher to intervene in real-time.

Experimental Results

The researchers evaluated RICP across seven benchmarks covering three distinct domains:

  1. Mathematical Reasoning: GSM8K, SVAMP, MathQA, AQuA.
  2. Commonsense Reasoning: CSQA, StrategyQA.
  3. Logical Reasoning: LogiQA.

They applied RICP on top of various prompting strategies, such as Standard Prompting, Zero-shot Chain-of-Thought (CoT), and Auto-CoT.

Performance Comparison

The results were consistent: RICP improved performance across the board.

Table comparing performance of different models on mathematical reasoning with and without RICP.

As seen in Table 1, applying RICP (labeled “Ours”) yielded significant improvements. For example, on the challenging AQuA dataset using GPT-3.5-Turbo, the Zero-shot CoT accuracy jumped from 31.07% to 38.08%—a relative improvement of over 22%.

The method showed particularly strong results for “Zero-shot” baselines. This makes sense: methods like Few-shot CoT already include examples in the prompt, which helps guide the model. Zero-shot methods have no examples, so the addition of RICP principles provides much-needed guidance where there was previously none.

Ablation Studies: Do we need all the parts?

The researchers performed an ablation study to verify if both “Task-Level” and “Question-Level” principles were necessary, and if the clustering mechanism actually helped.

Ablation study charts showing performance drops when components like Task-Level Principles (TP), Question-Level Principles (QP), or Hierarchical Clustering (HC) are removed.

Figure 3 illustrates the findings:

  • w/o QP (Without Question-Level Principles): Removing the specific, retrieved advice caused a sharp performance drop (represented by the orange line). This confirms that generic advice isn’t enough; the model needs tips specific to the question at hand.
  • w/o TP (Without Task-Level Principles): Removing general advice (green line) also hurt performance, though sometimes less severely than removing QP.
  • w/o HC (Without Hierarchical Clustering): Removing the clustering step (blue line) significantly degraded results. This highlights that simply retrieving raw insights isn’t effective—they need to be organized and de-duplicated to be useful.

Customization is Key

One of the central claims of RICP is that “customized” retrieval is better than just randomly selecting principles. The researchers validated this by comparing their method against a baseline where principles were selected randomly from the pool.

Comparison chart showing customized retrieval significantly outperforming random selection.

The blue bars in Figure 5 represent “Ours Win”—instances where RICP outperformed random selection. Across benchmarks like GSM8K and AQuA, customized retrieval won the vast majority of the time. In fact, using random principles can sometimes be worse than using no principles at all, as irrelevant advice confuses the model.

Understanding the Errors

Finally, it is interesting to look at what kind of errors the model actually fixes.

Pie charts showing error type distribution for Commonsense and Mathematical reasoning.

In Mathematical Reasoning (Chart b), logical errors and calculation errors dominate. RICP helps significantly here by retrieving principles like “check your arithmetic” or “pay attention to unit conversion.” In Commonsense reasoning (Chart a), context errors are more prevalent.

Conclusion

The “Retrieved In-Context Principles” (RICP) framework represents a maturation in how we approach In-Context Learning. It moves beyond simply showing models what “good” looks like, and starts teaching them how to avoid “bad.”

By automating the role of a teacher—analyzing mistakes, categorizing them, and retrieving them only when relevant—RICP provides a way to customize LLM behavior without the massive computational cost of fine-tuning or the latency of human-in-the-loop intervention.

For students and practitioners in NLP, RICP highlights a vital lesson: Data curation matters. It’s not just about having principles; it’s about organizing them (clustering) and delivering them at the right time (retrieval). As models continue to evolve, systems that can self-reflect and learn from their own history of errors will likely become standard in the quest for robust AI reasoning.