Teaching LLMs to Learn on the Fly: Deep Dive into MICRE for Relation Extraction

Large Language Models (LLMs) have revolutionized how we approach learning new tasks. With in-context learning (ICL), they can absorb a pattern from just a few examples in a prompt and apply that pattern to new cases. Show GPT-4 several French-to-English translations, and it can translate a new French sentence—even without retraining. This ability to generalize quickly has made LLMs powerful and flexible.

But there’s a catch. When the task becomes more structured and fine-grained, such as Relation Extraction (RE), LLMs tend to falter. RE asks models to identify relationships between entities in text—like detecting that in “Steve Jobs co-founded Apple in Cupertino”, the relation between “Steve Jobs” and “Apple” is co-founded. These relationships power knowledge graphs, search engines, and intelligent QA systems. You might expect LLMs to handle this seamlessly, yet they struggle, especially under zero-shot (no examples) or few-shot (just a few examples) settings.

The paper “Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors” tackles this challenge head-on. Instead of focusing on perfecting prompts or picking better examples, the authors ask a deeper question:

What if we could fundamentally improve an LLM’s ability to perform in-context learning for relation extraction?

Their answer is MICRE — Meta In-Context learning of LLMs for Relation Extraction. The idea is simple yet transformative: rather than training an LLM to perform a single RE task, train it on how to learn new RE tasks from examples. Essentially, MICRE teaches the model how to learn in context.

Once trained this way, the model becomes highly effective in both zero- and few-shot RE scenarios, capable of understanding new relation setups without any task-specific fine-tuning.

Understanding Relation Extraction and Its Challenges

Relation extraction tasks generally fall into two categories:

Relation Classification (RC): Given a sentence and an entity pair, classify the relationship between them. Example: Sentence: “Annabeth is an English name derived from Anna and Elizabeth.” Entities: (English, Elizabeth) → Relation: language_of_work_or_name
Relational Triple Extraction (RTE): Given a sentence, extract all possible triples (subject, relation, object). Example: Sentence: “The Natra river is a tributary of the Lisava river in Romania.” Extracted triple: (Natra river, tributary, Lisava river)

Traditional supervised models can handle these tasks well—but only if given large annotated datasets. This dependency makes them costly and inflexible for new domains where relations differ.

LLMs promised a path out: with ICL, just show them a few examples and skip the retraining. However, RE’s structured nature poses difficulties. LLMs often miss subtle semantics or entity boundaries, resulting in weak generalization in zero- or few-shot setups. Previous studies tried to fix this with clever prompt formats or curated demonstrations, but these only patch the symptoms. MICRE targets the root cause—the LLM’s inherent ability to learn how to learn.

How MICRE Teaches LLMs to Learn

MICRE borrows its foundation from meta-learning—the concept of “learning to learn.” Instead of teaching a model how to solve one dataset, it teaches it how to infer the right patterns from examples across many datasets. Through repeated exposure to varied RE tasks, the LLM internalizes the process of in-context learning.

Overview of the MICRE meta-training workflow. The process involves sampling from multiple datasets, formatting examples using tabular prompting, and training the LLM to predict the final output in an in-context learning setup.

Figure 1: Overview of the MICRE meta-training workflow. The output elements predicted by the LLMs are highlighted in red.

Here’s how MICRE’s learning process unfolds:

Collect Diverse Datasets: The model starts by accessing a collection of relation extraction datasets covering different domains—news, science, biomedicine, general NLP, and more.
Sample a Task: During each training cycle, MICRE randomly picks one dataset. This forces the model to handle various styles and relation schemas, improving adaptability.
Build In-Context Examples: From that dataset, it samples k+1 examples. The first k serve as demonstrations—the “context”—and the final one acts as the query.
Set the Objective: The LLM receives all k demonstrations (input-output pairs) followed by the query input. It must predict the correct output (y_{k+1}) of the final example. The loss is computed by comparing its prediction with the true label.

By repeating this procedure over thousands of batches and datasets, MICRE doesn’t just memorize relations—it learns the skill of recognizing and applying relationship logic within new contexts. At inference time, it uses this learned ability to reason effectively from just a few examples or even none.

Tabular Prompting: Unifying Structured Learning

A crucial design choice in MICRE is tabular prompting, a simple yet effective format that provides structure and clarity.

Instead of raw text formats, examples are presented in a miniature table, such as:

1
2
| Predicate | Subject | Object |
| co-founded | Steve Jobs | Apple |

This format helps in two ways:

Unified Representation: The same table structure supports both RC and RTE tasks, eliminating inconsistencies across datasets.
Guided Output Generation: In zero-shot settings, even without examples, the header communicates to the model what form its answer should take.

Inference: Putting MICRE to Work

Once trained, MICRE turns into a general relation extraction learner. Unlike fine-tuned models, it doesn’t change its parameters during testing—only its prompts.

Few-Shot Inference: Provide a handful of examples (say, five relations) formatted in MICRE’s table style. Add your new test sentence at the end, and the model predicts the right relation or triple.
Zero-Shot Inference: When no examples are provided, MICRE iterates through each possible relation label in the target dataset. For each candidate relation r, it generates potential subjects and objects, selecting the combination with the highest conditional probability. This process allows the model to choose the most semantically appropriate relation—even in completely novel settings.

The Experiments: MICRE vs. the World

MICRE was evaluated using open-source LLMs including GPT-2, T5, and LLaMA across 12 public RE datasets spanning diverse domains.

Statistics of the 12 datasets used for meta-training MICRE.

Figure 2: Summary of the datasets used during MICRE’s meta-training phase.

The model was tested on unseen benchmarks FewRel and Wiki-ZSL, ensuring no overlap in relation labels between training and test sets.

Zero-Shot Relation Classification (RC)

MICRE shows impressive gains in zero-shot RC—especially with larger models.

Zero-shot Relation Classification (RC) results. MICRE with LLaMA (7B) achieves state-of-the-art or competitive performance, especially as the number of unseen relations (m) increases.

Figure 3: Zero-shot RC results on Wiki-ZSL and FewRel datasets. Larger model scales lead to stronger in-context performance.

Key observations:

Performance improves dramatically with model scale.
Encoder-decoder architectures (like T5) often outperform decoder-only ones (like GPT-2) at similar sizes.
MICRE with LLaMA achieves the best or second-best recall in most cases, showing superior generalization.

Zero-Shot Relational Triple Extraction (RTE)

Extracting structured triples is harder than simple classification. Yet MICRE shines here as well.

Zero-shot Relational Triple Extraction (RTE) results. MICRE with larger models like T5-3B and LLaMA-7B significantly outperform prior state-of-the-art methods.

Figure 4: Zero-shot RTE results show MICRE surpasses previous state-of-the-art approaches by a large margin.

Large models like T5-3B and LLaMA outperform previous best systems (e.g., ZETT) by significant margins, sometimes over 9 accuracy points, confirming that meta-training equips LLMs to understand complex relations effectively.

Few-Shot Mastery

When provided with just a few examples, MICRE’s advantages really take flight.

Few-shot Relation Classification (RC) and Relational Triple Extraction (RTE) results on the FewRel dataset. MICRE achieves excellent performance across the board.

Figure 5: Few-shot performance comparison. MICRE delivers top-tier results, rivaling fine-tuned models despite minimal context.

Highlights:

On FewRel, MICRE with LLaMA scores an average F1 of 95.12 in RC—comparable to (and sometimes better than) the best fine-tuned competitors.
For RTE, MICRE achieves 58.29 average F1, exceeding the previous best by over 4 points.
Remarkably, MICRE achieves these results without task-specific training on the test datasets—just in-context reasoning powered by meta-learned behavior.

Why MICRE Works: Insights from Ablation Studies

To unravel MICRE’s success, the authors conducted experiments varying the number of examples and datasets used in meta-training.

Ablation studies on the number of in-context training examples (k) and the number of meta-training datasets (C). Performance generally improves with more examples and more datasets, with some saturation. The bottom part of the image shows examples for error analysis.

Figure 6: Ablation results for training examples and dataset diversity. The more examples (left) and datasets (right) MICRE sees, the better it generalizes.

Key takeaways:

More in-context examples help—up to around k = 16, after which performance stabilizes (models can only process limited context windows efficiently).
Diverse training datasets improve generalization—exposure to more domains boosts low-shot performance, but the choice of datasets still matters.
Semantic labels matter. Replacing relation names with meaningless tokens (like R1, R2) caused steep drops, proving that MICRE learns from both structure and semantics.

Results of replacing semantic relation labels with generic tokens. Performance drops significantly, indicating the model relies on the semantic meaning of the labels to perform the task.

Figure 7: When relation names are replaced with neutral tokens, performance falls sharply. MICRE learns semantic relationships between meaningful labels and textual entities.

What MICRE Reveals About Learning

Through experiments and error analysis, the authors discovered that MICRE occasionally misclassifies or reverses relation direction—for example, confusing tributary with mouth_of_the_watercourse due to reciprocal semantics. Despite such challenges, MICRE’s predictions often remain semantically plausible, hinting that it genuinely understands relational context rather than merely memorizing patterns.

Error examples demonstrate that some mistakes arise from noisy datasets or overlapping relations, like predicting publisher instead of tracklist when multiple valid relations coexist. These cases indicate that MICRE can infer multiple potential relationships from a sentence—a promising feature for future multi-relation extraction research.

Conclusion: Teaching LLMs to Learn Better

MICRE represents a significant step forward in the quest to make LLMs better learners, not just better models. By introducing meta-training for in-context learning, the approach turns generic language models into flexible relation extractors capable of adapting instantly to new, unseen tasks.

Key insights:

Meta-learning enhances ICL: Training LLMs to perform in-context learning directly improves their low-shot reasoning capabilities.
Scale and diversity amplify gains: Larger, diverse training sets cultivate stronger, more generalized models.
Semantic understanding remains vital: Recognizing relation names and their underlying meanings is key to high performance.

In short, MICRE shows that the next leap for LLMs won’t just be making them bigger—it will be making them better learners. By teaching models how to learn within context, we can unlock stronger performance on structured tasks like relation extraction and beyond, paving the way for more adaptive, intelligent systems across natural language understanding.

Understanding Relation Extraction and Its Challenges#

How MICRE Teaches LLMs to Learn#

Tabular Prompting: Unifying Structured Learning#

Inference: Putting MICRE to Work#

The Experiments: MICRE vs. the World#

Zero-Shot Relation Classification (RC)#

Zero-Shot Relational Triple Extraction (RTE)#

Few-Shot Mastery#

Why MICRE Works: Insights from Ablation Studies#

What MICRE Reveals About Learning#

Conclusion: Teaching LLMs to Learn Better#