Beyond Abstracts: Can AI Truly Understand Scientific Research?

If you have ever tried to read a dense scientific paper in a field you are only partially familiar with, you know the struggle. It is not just about reading the words; it is about understanding the context, deciphering the tables, interpreting the equations, and connecting the dots between the appendix and the methodology.

For Artificial Intelligence, this challenge is magnified. While Large Language Models (LLMs) like GPT-4 and Llama have shown impressive capabilities in summarizing text, scientific literature remains a final frontier. Most existing datasets used to test AI on science are surprisingly shallow—they often rely on abstracts or simple fact retrieval.

But what if we could test AI using the toughest questions possible? Questions asked by domain experts who have scrutinized every line of a paper?

This is the premise behind SCIDQA, a new deep reading comprehension dataset introduced by researchers from IIT Gandhinagar, Yale University, and the Allen Institute for AI. By leveraging the rigorous back-and-forth of the academic peer review process, they have created a benchmark that demands true reasoning, not just pattern matching.

In this post, we will dive deep into how SCIDQA was built, why it is different from what came before, and how modern LLMs perform when faced with the scrutiny of a peer reviewer.


The Problem with Current Scientific QA

Before we explore SCIDQA, we need to understand the gap it fills. The Natural Language Processing (NLP) community has created several datasets to help machines “read” science. However, they suffer from significant limitations:

  1. Surface-Level Information: Many datasets (like PubMedQA or QASPER) rely heavily on titles and abstracts. An AI can often answer these questions without reading the full paper.
  2. Synthetic Questions: Some datasets are generated automatically or by non-experts. These questions tend to be simple lookup tasks (e.g., “What is the accuracy rate?”) rather than deep reasoning tasks.
  3. Short Answers: The answers are often just a “yes/no” or a short span of text extracted directly from the document.

Real scientific engagement requires deep comprehension. When a reviewer critiques a paper, they might ask about the implications of a specific equation, point out a contradiction between a figure and a table, or ask how the method compares to a paper published three years ago.

SCIDQA (Scientific Document Question Answering) aims to replicate this level of depth.


Building SCIDQA: Mining the Peer Review Process

The genius of SCIDQA lies in its source material: OpenReview.

OpenReview is a platform used by top-tier machine learning conferences (like ICLR and NeurIPS) where the peer review process is public. Reviewers post comments and questions, and authors post detailed replies. This dialogue is a goldmine of expert-level Question-Answer (QA) pairs.

The Curation Pipeline

Creating a clean dataset from messy internet forum discussions is not a trivial task. The researchers developed a complex pipeline to transform raw discussions into a structured dataset.

Dataset curation pipeline for SCIDQA. LLM-based QA extraction from peer reviews is followed by comprehensive human expert annotation and editing.

As shown in Figure 2, the process involves several critical stages:

  1. Collection: They gathered data from 11,400 papers from top conferences.
  2. PDF-to-Text: Using a specialized tool called Nougat (Neural Optical Understanding for Academic Documents), they converted the scientific PDFs into text. This is crucial because standard PDF parsers often destroy the formatting of math and tables.
  3. Extraction: They used the PaLM language model to identify and extract QA pairs from the nested discussion threads.
  4. Annotation & Refinement: This is the most important step. Human experts (graduate students in NLP/ML) reviewed the data to ensure quality.

What Does a QA Pair Look Like?

To visualize the source data, look at Figure 1 below. It shows how a discussion on OpenReview translates into a dataset entry.

An instance in the SciDQA dataset. The question and answer corresponding to the paper are extracted from the reviewer-author discussion on OpenReview.

On the left, you see the “Reviewer” asking a specific technical question about binary masks versus soft-mask methods. The “Author” responds with a detailed justification involving inductive bias. In the SCIDQA dataset (center), this is standardized into a clear Question and Answer format, linked to specific evidence in the paper (like Table 1 or Figure 1).

The Art of Refinement

Raw data from the internet is rarely ready for machine learning. The authors had to solve three unique problems to make SCIDQA reliable.

1. Decontextualization

In a forum, people speak in the first person (“Why did you do this?” " We found that…"). For a general QA dataset, this is confusing. The model shouldn’t think it is the author.

The researchers rewrote questions and answers into the third person. They also added necessary context that might have been implicit in the conversation.

Rewriting QA pairs in a third-person narrative is crucial for models to recognize that questions seek factual answers.

As seen in Figure 4, a question asking “Do you claim…” is rewritten to “Do the authors claim…” This small shift ensures the AI understands it is an observer analyzing the text, not a participant in the debate.

2. Reference Editing

Scientific papers are full of citations like “[12]” or “(Smith et al., 2020).” If a question asks, “How does this compare to [12]?”, an AI might just look for the string “[12]” in the text without understanding the content. This is a “shortcut” that inflates performance scores.

References in question and answer texts are uniformly renumbered to preclude the LM from leveraging specific reference markers as shortcuts.

To prevent this, the researchers anonymized citations (e.g., changing them to [r1], [r2]) and included the full bibliographic reference in the question text (see Figure 5). This forces the model to understand which paper is being discussed based on the title and authors, not just a number.

3. Version Control

During peer review, papers change! Authors upload revised PDFs to address reviewer concerns. A question might refer to “Table 3,” but in the revised version, that might become “Table 4.”

We present scenarios where the initial and the revised manuscript versions are most appropriate for answering the reviewer’s question.

As illustrated in Figure 6, the researchers carefully tracked whether a question was best answered by the initial submission or the camera-ready version. If the authors added a new table to answer a question, the final version is the correct source. If the answer was already in the text but the reviewer missed it, the initial version is used.


How Does SCIDQA Compare?

Is this dataset actually harder or different from existing ones? Let’s look at the statistics.

Comparison of the related datasets. Note the answer length and source.

Table 1 highlights the differences:

  • Source: SCIDQA is based on Full-Text, unlike QASPER or PubMedQA which often rely on abstracts.
  • Answer Length: The average answer length is over 100 words—significantly longer than other datasets. This indicates that the answers require explanation, not just factoid extraction.
  • Multiple Documents: It is the only dataset in this list that explicitly requires reasoning across multiple documents (the main paper + referenced papers).

Experimental Setup: Testing the AI

To see how well current technology handles this deep reasoning, the researchers set up four different “exam conditions” for various Large Language Models (LLMs), ranging from open-source models like Llama 2/3 to proprietary giants like GPT-4o.

1. Closed-Book (The “Memory” Test)

In this setting, the model is given only the question. It must rely on its internal training data. Since these are famous papers, the model might have “read” them during training.

Priming LLMs with Questions (closed-book).

2. Title and Abstract (The “Skim” Test)

Here, the model gets the question plus the paper’s title and abstract. This mimics a researcher who only reads the summary before trying to answer a deep question.

Open-Domain Question Answering - Priming with Question and Title/Abstract.

3. RAG and Full-Text (The “Open Book” Test)

This is the most realistic scenario. The model is given the actual content of the paper.

  • RAG (Retrieval-Augmented Generation): The system searches for the most relevant chunks of text in the paper and feeds them to the LLM.
  • Full-Text: The model attempts to process the entire paper.

RAG setup ranks paper subsections. Full-text passes segments to base-LLM.

As shown in Figure 9/10 (combined image above), processing full text is tricky because papers are long. For models with limited context windows, the researchers chunked the paper, generated answers for each chunk, and then used a powerful model (Llama 3.1 70B) to select the best answer. For modern “long-context” models (like GPT-4o or Gemini 1.5), they fed the whole paper in at once.


Results: Who is the Smartest Scientist?

The results reveal a stark reality: deep scientific comprehension is still very hard for AI.

The Leaderboard

Average scores for all configurations.

Table 3 provides the high-level summary. Here are the key takeaways:

  1. GPT-4o dominates: The proprietary model from OpenAI consistently outperforms open-source alternatives, achieving the highest scores across almost all metrics.
  2. Context Matters (Usually): For most models, having access to the text (RAG or Full-Text) improves performance compared to the Closed-Book setting.
  3. The “Hallucination” Trap: Interestingly, for the strongest models (like GPT-4o), the performance gap between “Closed-Book” and “Full-Text” is surprisingly small (around 5 points). This suggests two possibilities:
  • The model has memorized these papers during training (contamination).
  • The model is good at making up convincing-sounding scientific answers even without the source text.

Deep Dive: Full-Text Performance

RAG setup prompts and Full-text evaluation detailed scores.

Looking closer at the Full-Text results in Table 5, we see that even the best open-source models (like Llama 3.1 70B) struggle to match GPT-4o.

However, there is a fascinating nuance. While GPT-4o scores highly on surface metrics (like ROUGE, which measures word overlap), human evaluation and “LLM-as-a-Judge” metrics paint a more complex picture.

The researchers noted that simply finding the right paragraph isn’t enough. SCIDQA requires multi-modal reasoning.

  • 14% of questions require reading tables.
  • 10% require understanding equations.
  • 7% require interpreting figures.

Most text-only LLMs fail miserably at these questions because they cannot “see” the figures or parse complex LaTeX equations effectively.

Human vs. AI

Are the machines beating us yet? The authors conducted a small study comparing GPT-4 answers against human-written answers.

  • 32% of the time, it was a tie.
  • 29% of the time, humans were preferred (mostly because GPT-4 made factual errors).
  • 21% of the time, GPT-4 was preferred (mostly because the human annotator wasn’t an expert in that specific niche sub-field).

This shows that while AI is competent, a domain expert is still the gold standard for accuracy.


Conclusion and Implications

SCIDQA represents a significant step forward in evaluating AI. By moving away from synthetic questions and abstracts, it forces us to confront the limitations of current models.

The study shows that while LLMs are becoming excellent linguistic mimics, their ability to perform deep, multi-step reasoning over complex scientific documents is still evolving. They struggle with the specific nuances that peer reviewers care about—methodological flaws, subtle comparisons to prior work, and interpretation of experimental data in tables.

For students and researchers in AI, SCIDQA serves as a new “North Star.” Solving this dataset won’t just mean we have better chatbots; it will mean we have AI assistants capable of truly helping scientists navigate the explosive growth of human knowledge.

The future of scientific discovery might just depend on how well machines can answer the questions posed in these peer reviews.


Note: All images and tables referenced in this article are sourced directly from the SCIDQA research paper.