Can AI Track Its Own Progress? Automating Scientific Leaderboards with LLMs

Introduction

We are living through an explosion of scientific research. In the field of Computation and Language alone, approximately 100 new papers are uploaded to arXiv every single day. For a researcher, student, or practitioner, keeping up with this torrent of information is no longer just difficult—it is humanly impossible.

The core question everyone asks is: “What is currently the state-of-the-art?”

To answer this, the community relies on Scientific Leaderboards. These are ranked lists that track how well different models perform on specific tasks (like translation or summarization) using specific datasets. Platforms like Papers With Code or NLP-progress have become the de facto homepages for researchers trying to benchmark their work.

However, there is a major bottleneck: these leaderboards are mostly curated manually. As the volume of papers grows exponentially, manual curation simply cannot keep up. We are left with leaderboards that are often outdated, incomplete, or missing entirely for niche sub-fields.

This brings us to a fascinating research paper: “Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards.” The researchers propose a system that uses Large Language Models (LLMs) to automatically read papers, extract performance data, and build or update leaderboards without human intervention.

Figure 1: We first extract task, dataset, metric, and result (TDMR) tuples from scientific publications. Then, we update existing leaderboards of the same TDM (purple and blue). Different from previous work, we also construct a new leaderboard on demand (green).

As shown in Figure 1, the goal is to create a system that can digest new papers (Paper A and Paper B), extract the relevant stats, and either update existing rankings or—crucially—recognize when a new task has been invented and create a brand new leaderboard from scratch.

In this post, we will tear down their methodology, explore the “Cold Start” problem in automated curation, and analyze why LLMs are great at reading text but surprisingly bad at reading tables.

Background: The Anatomy of a Leaderboard

Before we can automate a leaderboard, we have to define what it actually is. In this research, a leaderboard is defined by a TDM Triple:

Task: What is the AI trying to do? (e.g., Named Entity Recognition).
Dataset: What data is it testing on? (e.g., CoNLL-2003).
Metric: How do we measure success? (e.g., F1 Score).

When you combine a TDM Triple with a specific score from a paper, you get a TDMR Tuple (Task, Dataset, Metric, Result).

The Limitation of Current Methods

Previous attempts to automate this process have relied on a “closed world” assumption. They assume that we already know the names of every possible task and dataset. In this scenario, the AI just looks for keywords from a pre-defined list.

But science is open-ended. Researchers constantly invent new tasks and new datasets. If an automated system encounters a paper about a brand-new task called “Quantum Sentiment Analysis,” and that task isn’t in its pre-defined list, the system will fail to track it. This paper addresses that gap by designing a system capable of handling undefined or partially defined environments.

The SCILEAD Dataset

To train and test their system, the authors created SCILEAD, a manually curated dataset derived from 43 NLP papers. Unlike community-sourced data (which often contains errors), this dataset provides a “Gold Standard” of perfectly annotated TDMR tuples.

Table 10: Example TDMR instances from ScILEAD. TDMR tuples per each paper are color coded.

Table 10 above gives you a look at the ground truth data. Notice how a single paper (color-coded) often contributes to multiple leaderboards. For example, the paper 1703.06345.pdf provides results for Named Entity Recognition (NER) in English, Spanish, and Dutch, as well as POS Tagging. A robust system must capture all of these.

The Core Method: An LLM-Based Framework

The researchers propose a three-stage framework to solve the leaderboard construction problem. They utilize Retrieval-Augmented Generation (RAG) to help LLMs find the right information within dense academic PDFs.

Figure 2: Our framework in three steps: (1) TDMR Extraction, (2) Normalization, (3) Leaderboard Construction

As illustrated in Figure 2, the pipeline flows as follows:

TDMR Extraction: Finding the raw data in the paper.
Normalization: Cleaning up the data and mapping it to standard names.
Leaderboard Construction: Ranking the results.

Let’s break these down in detail.

Stage 1: TDMR Extraction via RAG

Scientific papers are long. You can’t simply paste a 15-page PDF into ChatGPT and ask for the results; it often exceeds context windows or confuses the model.

To solve this, the authors use a RAG approach:

Parsing: They use a PDF processing tool to extract text and tables.
Chunking: The text is split into chunks of roughly 500 tokens (about 2,000 characters).
Retrieval: When the system wants to find results, it searches the vector database using a specific query: “Main task, datasets and evaluation metrics”. It retrieves only the chunks and tables most likely to contain this info.
Extraction: These relevant chunks are fed into an LLM (like GPT-4 or Llama-3) with a prompt instructing it to extract the best-reported results for the proposed method.

The output at this stage is raw strings, like “NER,” “Conll03,” or “F-score.”

Stage 2: Normalization (The Brain of the Operation)

This is the most intellectually interesting part of the paper. The raw output from an LLM is messy. One paper might call a task “NER,” another “Named Entity Rec.,” and another “Entity Tagging.” If we don’t standardize (normalize) these, we can’t build a single leaderboard.

The authors tested three different settings to simulate real-world difficulty:

Fully Pre-defined: The system is given a list of valid names (a taxonomy). It just has to map the extracted text to the closest match in the list.
Partially Pre-defined: The system knows some tasks, but others are masked. It has to recognize when it sees something new.
Cold Start: The hardest setting. The system starts with zero knowledge of existing leaderboards. It must build a taxonomy on the fly.

The Cold Start Algorithm

How do you organize a library if you don’t know the categories beforehand? You do it dynamically.

Algorithm 1 Partially pre-defined TDM / cold start

Algorithm 1 shows how this dynamic normalization works.

The system maintains a set of known entities (\(S'_t\)), which starts empty in the Cold Start setting.
When the LLM extracts a new term (\(l_t\)), it checks if it matches anything currently in the set.
If it matches (e.g., “NER” matches “Named Entity Recognition”), it normalizes it.
If it does not match (e.g., the system has never seen “Sentiment Analysis” before), it adds this new term to the set (\(S'_t\)).

This mimics how human researchers mentally categorize new papers. If we see a new term, we file it away as a new category; if we see a synonym for an old term, we group them.

Stage 3: Leaderboard Construction

Once the data is extracted and normalized, the final step is aggregation. The system groups all tuples that share the same Task, Dataset, and Metric.

It then performs a sanity check on the Results. Since different papers report numbers differently (e.g., “0.91” vs. “91.0” vs. “91%”), the system standardizes these into a common percentage format. Finally, it sorts the papers by their score to determine the ranking.

Table 11: Example leaderboards from ScILEAD based on common TDMR tuples in Table 10. The same color codes are used.

Table 11 shows the output: clean, ranked lists where papers (represented by their IDs) are ordered by their performance on specific datasets.

Experiments & Results

The researchers evaluated their framework using several Large Language Models: Llama-2, Mixtral, Llama-3, and GPT-4 Turbo. They compared these against a baseline model called AxCell.

They used two primary ways to grade the AI:

Exact Tuple Match (ETM): Did the AI get the entire combination (Task + Dataset + Metric + Result) perfectly right?
Individual Item Match (IIM): Did the AI get specific parts right? (e.g., it got the Task right, but missed the Result).

1. The Difficulty of Perfection (ETM Scores)

Getting the full tuple correct is incredibly hard because if even one digit in the result is wrong, the whole tuple is counted as a failure.

Table 3: Exact tuple match (ETM) evaluation scores for different normalization settings (%). R: Recall, P: Precision, F1: F1 score. LLM + CS indicates the cosine similarity baseline for normalizing individual entities. The best results for each normalization setting are underlined. The overall highest results are bolded.

In Table 3, we see the Exact Tuple Match scores.

GPT-4 Turbo is the clear winner, achieving an F1 score of 55.27% in the Fully Pre-defined setting.
Performance Drop: Notice how the performance drops significantly for Llama-2 and Mixtral when moving from “Fully Pre-defined” to “Partially Pre-defined.” This confirms that handling unknown tasks is a major challenge for smaller models.
GPT-4’s Resilience: GPT-4 remains relatively robust even in the harder setting, maintaining a recall of nearly 40%.

2. Why Do Models Fail? (IIM Scores)

To understand why the Exact Match scores were somewhat low (55% is good for this task, but not perfect), we need to look at the component parts. Are the models failing to identify the Task? Or are they failing to read the numbers?

Table 4: Individual item match (IIM) scores (%). R: Recall, P: Precision, F1: F1 score. The best results for each setting are underlined. Overall highest results are given in bold. Since normalization is not applied to Results, its scores are the same across both settings.

Table 4 reveals the bottleneck. Look at the IIM-Result column (far right).

Models are excellent at identifying Tasks (90%+ F1 scores).
Models are great at identifying Metrics (80-90% F1 scores).
Models are terrible at extracting Results. Even GPT-4 only manages a 69% F1 score for result extraction, and Llama-2 is down at 27%.

Why? Scientific papers present results in complex tables. They often have multiple columns for different variations of a model (e.g., “Model-Base”, “Model-Large”, “Ablation-1”). Distinguishing the best result of the proposed method from baseline results or ablation studies is a reasoning task that still challenges current LLMs.

3. Leaderboard Reconstruction Quality

Finally, how well did the constructed leaderboards compare to the real ones? The authors measured this using Leaderboard Recall (LR) (did we find the leaderboard?) and Average Overlap (AO) (is the ranking similar to the ground truth?).

Table 5: Gold leaderboard evaluation (%). LR: Leaderboard recall, PC: Paper coverage, RC: Result coverage, AO: Average Overlap. The best results for each setting are underlined. Overall best results are given in bold. Standard dev. for cold start are given in Appendix F.

Table 5 shows the results for all three settings, including the challenging Cold Start.

GPT-4 successfully reconstructs 81.48% of the leaderboards even in the Cold Start setting (starting with zero knowledge).
Paper Coverage (PC) is decent (~60%), meaning the system finds most relevant papers.
Result Coverage (RC) is the weak point (~46%), again reflecting the difficulty of extracting specific numbers from tables.

4. The “Cold Start” Surprise

An interesting anomaly appears when comparing the “Partially Pre-defined” and “Cold Start” settings. You would expect “Cold Start” (knowing nothing) to be harder than “Partially Pre-defined” (knowing some things).

However, for GPT-4, the results are actually comparable, and sometimes slightly better, in the Cold Start setting. The authors suggest that in the partial setting, the model sometimes gets “confused” by trying to force a new task into an old, pre-defined bucket that looks similar (e.g., mapping “English NER” to “German NER”). In Cold Start, the model is free to create a new category immediately, which can sometimes lead to cleaner taxonomies.

Conclusion & Implications

This paper presents a significant step forward in automated meta-science. The introduction of SCILEAD provides a necessary benchmark for this task, and the LLM-based framework demonstrates that we can indeed automate the tracking of scientific progress.

Key Takeaways:

LLMs are ready for the taxonomy: Current models (especially GPT-4) are highly effective at understanding and categorizing scientific tasks and metrics, even without prior training (Cold Start).
Tables are the final boss: Extracting precise numerical results from complex LaTeX tables remains the biggest hurdle. The drop in accuracy from “Task Extraction” to “Result Extraction” is steep.
Real-world application is feasible: Despite the imperfections, the system can successfully identify and reconstruct the majority of leaderboards.

For students and researchers, this implies a future where “State-of-the-Art” isn’t something you have to hunt for in a dozen PDFs, but a dashboard that updates itself the moment a paper is published. As multimodal LLMs (which can “see” the visual structure of tables) improve, we can expect the result extraction bottleneck to disappear, paving the way for fully autonomous scientific tracking.

Introduction#

Background: The Anatomy of a Leaderboard#

The Limitation of Current Methods#

The SCILEAD Dataset#

The Core Method: An LLM-Based Framework#

Stage 1: TDMR Extraction via RAG#

Stage 2: Normalization (The Brain of the Operation)#

The Cold Start Algorithm#

Stage 3: Leaderboard Construction#

Experiments & Results#

1. The Difficulty of Perfection (ETM Scores)#

2. Why Do Models Fail? (IIM Scores)#

3. Leaderboard Reconstruction Quality#

4. The “Cold Start” Surprise#

Conclusion & Implications#