Beyond the Leaderboard: Why Large Language Models Still Fail at Reading Comprehension

In the fast-paced world of Natural Language Processing (NLP), we are often dazzled by the sheer scale of modern models. From GPT-4 to LLaMA, the headlines focus on parameter counts—billions upon trillions—and their dominance on standardized leaderboards. But there is a quiet, persistent problem in the field: the “black box” nature of evaluation.

We know that models fail. We see them hallucinate, miss obvious details, or misinterpret simple questions. However, looking at a global accuracy score on a benchmark like SQuAD or SuperGLUE doesn’t tell us why they fail. Is it the syntax? Is it the vocabulary? Is it the ambiguity of the meaning?

This blog post breaks down a fascinating research paper, “A linguistically-motivated evaluation methodology for unraveling model’s abilities in reading comprehension tasks,” which proposes a new way to audit these models. Instead of just checking if an answer is right or wrong, the researchers use a “linguistic microscope” to identify specific complexity factors that cause models—regardless of their size—to stumble.

The Problem with Black-Box Evaluation

Standard evaluation in Question Answering (QA) typically works like a multiple-choice exam. The model is given a document and a question, and it produces an answer. We compare that answer to a “gold standard” reference using metrics like exact match or ROUGE scores.

While this ranks models effectively, it creates a knowledge gap. If a 7-billion parameter model fails a question that a 175-billion parameter model gets right, we assume “bigger is better.” But what if both models fail the same type of question consistently? That points to a fundamental linguistic weakness in the architecture itself, rather than a lack of capacity.

The researchers in this study argue that we need to partition evaluation data not just randomly, but by linguistic complexity. To do this, they leverage the concept of Semantic Frames.

Background: What are Semantic Frames?

To understand the methodology, we first need to understand the tool being used: Frame Semantics. Developed in the Berkeley FrameNet project, a “frame” is a schematic representation of a situation.

Think of the concept of Selling. To have a “Selling” event, you strictly need a few participants: a Seller, a Buyer, the Goods, and the Money. In FrameNet terms, “Selling” is the Frame, the word “sold” or “vended” is the Lexical Unit (LU) or “trigger,” and the participants are the Frame Elements (FEs).

The researchers used the CALOR corpus, a French dataset meticulously annotated with these frames. By analyzing how models handle questions related to these frames, they could pinpoint exactly where the comprehension breaks down.

Figure 1: Example of sentence annotated with two semantic frames

As shown in Figure 1 above, a single sentence can contain multiple frames. The blue box highlights an Attack frame (triggered by “assaults”) and the orange box highlights a Losing frame (triggered by “lost”). The questions generated for this study ask about specific elements, such as “Who lost?” (Answer: The armies).

The Methodology: A Two-Step Approach

The authors devised a clever two-step method to correlate linguistic complexity with model failure.

Step 1: Sorting by Difficulty (The ROVER Method)

First, they needed to determine which questions were objectively “hard” without relying on their own biases. They employed a voting system using 7 different models ranging from small (CamemBERT, T5) to large (GPT-3.5, Mixtral).

They grouped questions based on agreement.

Easy: All models agree on the answer.
Hard: All models disagree, producing different answers.

They call this system ROVER (Recogniser Output Voting Error Reduction). The intuition is simple: if every model, regardless of architecture, struggles to agree on an answer, the example likely contains an intrinsic linguistic difficulty.

Figure 3: Performance in Hscore according to the agreement number with the ROVER systems’combination method

Figure 3 illustrates this perfectly. The x-axis represents the partition (level of agreement), and the y-axis represents the human evaluation score. As you can see, performance drops linearly as model disagreement increases. The “hard” partition (left side) is difficult for everyone, including GPT-3.5.

Step 2: Defining the “Why” (Complexity Factors)

Once the questions were sorted into difficulty bins, the researchers applied their linguistic hypothesis. They defined seven specific Complexity Factors to see if they correlated with the “hard” bins.

These factors investigate different aspects of the language:

Bias ($f_{bias}$): Is the frame rare in the training data?
Coreference ($f_{coref}$): Does finding the answer require resolving a pronoun (e.g., figuring out who “he” refers to)?
Trigger Nature ($f_{trigger}$): Is the trigger word a noun (harder) or a verb (easier)?
Trigger in Question ($f_{LU \text{in} q}$): Is the word triggering the frame present in the question itself?
Syntactic Distance ($f_{dist}$): How far apart are the clue and the answer in the sentence structure?
Number of Frame Elements ($f_{nbFEs}$): How many “participants” (context clues) are annotated in the text?
Entropy ($f_{entropy}$): How ambiguous is the trigger word? (Does the word “run” usually mean “run a company” or “run a race”?)

Figure 2: Example of some complexity factors considered

Figure 2 visualizes some of these factors. For example, under Dist (Distance), a direct link between “drink” and “water” is easier than a sentence where “excavations” and the details about them are separated by complex clauses.

The Experiment: Models and Metrics

The team tested this methodology on a variety of models. Note the significant difference in size, from CamemBERT (335M parameters) to GPT-3.5 (175B parameters).

Table 1: Description of the 7 models used in our experiments with their performance in terms of Rouge-L, Hscore and Hcorrect scores. The last line indicates the performance of systems’ combination through the ROVER method.

A Note on Metrics

An important side finding of this paper, shown in Table 1, is the failure of the ROUGE metric (a standard automated score).

ROUGE-L favored the smaller models like CamemBERT because they were fine-tuned to extract exact text spans.
Human Evaluation (Hscore) favored the Large Language Models (LLMs) like GPT-3.5 because they provided correct answers that didn’t necessarily match the exact wording of the reference.

This confirms that for generative AI, we cannot rely solely on string-matching metrics.

Key Results: What Actually Makes Reading Hard?

The core contribution of this paper is the validation of the complexity factors. The researchers calculated a $\delta$ (delta) score: the performance drop a model suffers when a specific complexity factor is present.

$Table 2: Validation results for complexity factors across models, showing \$\\delta\$ values in each cell with statistically significant differences in bold. ‘Size’ indicates proportions of partitions \$E _ { f }\$ relative to the total corpus.$

Table 2 provides the “heatmap” of failure. Here is what the data tells us:

1. Size Solves Syntax, but not Semantics

Look at the column $f_{coref}$ (Coreference) and $f_{dist}$ (Distance).

Small Models (T5, MT5): They suffer massive performance drops (-9 to -15 points) when they have to resolve pronouns or handle long syntactic distances.
Large Models (GPT-3.5, Mixtral): They are barely affected.
Takeaway: Scaling up models seems to effectively solve syntactic complexity and coreference resolution.

2. The Semantic Bottleneck

Now, look at the columns $f_{nbFEs}$ (Number of Frame Elements) and $f_{entropy}$ (Entropy).

All Models Fail: Almost every model, regardless of size, shows a statistically significant performance drop here.
Entropy: This factor measures ambiguity. If a specific frame (like “Request”) can be triggered by many different words, models struggle to identify the context.
Number of FEs: Surprisingly, having fewer arguments (context clues) makes the task harder. If a sentence is rich with participants (Agent, Time, Location, Theme), the model has more anchors to ground its understanding. When the context is sparse (e.g., just an Agent and an Action), ambiguity rises, and performance falls.

3. The Ambiguity of Frames

The researchers broke down performance by specific Frames. They found that not all concepts are created equal.

Figure 4: Performance of ROVER according to each frame sorted by Hscore measure. The number of occurrences of each frame in the corpus is given between brackets

As Figure 4 displays, some frames yield consistently high scores, while others are pitfalls. The “Installing” frame is easy because it is almost always triggered by the words “install” or “installation.” Conversely, the “Request” frame is difficult because it has high entropy—it can be triggered by “ask,” “beg,” “demand,” “order,” “solicit,” etc., requiring the model to disentangle subtle nuances.

Visualizing the “Hard” Questions

To further prove that semantic factors are the true drivers of difficulty, the researchers analyzed the probability of these factors appearing in the “Hard” partitions (where models disagreed).

$Table 3: Probability of having the fnb FEs and fentropy factors according to the agreement partitions of increasing complexity \$P 6\$ to \$P 1\$$

Table 3 shows a clear trend. As we move from P6 (easiest, total agreement) to P1 (hardest, total disagreement), the probability of encountering high Entropy ($f_{entropy}$) and low context ($f_{nb FEs}$) increases.

This allows the researchers to confirm that semantic ambiguity is the primary reason for model failure in modern LLMs.

Figure 5: Hscore on 4 partitions of the evaluation corpus according to combinations of complexity factors

Figure 5 cements this finding. The purple bars represent questions with no semantic complexity factors—performance is high across the board. The blue bars represent questions with both complexity factors—performance plummets.

Generalization: Does this apply to English?

The initial experiments were performed on CALOR (French). A skeptic might ask: “Is this just a quirk of the French dataset?”

To verify this, the researchers extended their methodology to NaturalQA, a massive English benchmark. Since NaturalQA doesn’t have manual Semantic Frame annotations, they used a clever proxy: they prompted ChatGPT to predict the frames and elements, creating a “silver standard” annotation.

$Table 4: Manual evaluation (in \$\\%\$ ) of ChatGPT’s frame predictions across 5O random sentences$

As Table 4 shows, ChatGPT was surprisingly good at this linguistic annotation, achieving high accuracy. With these annotations in hand, they ran the analysis on English models.

$Table 5: Validation results for fnbFEs and fentropy across models on naturalQA.‘Size’ indicates proportions of partitions \$E _ { f }\$ relative to the total corpus.$

The results (Table 5) mirrored the French study. Models like LLaMA-3 and GPT-3.5 still suffered significant performance drops when facing high entropy or low context (Nb FEs). This confirms that semantic complexity is a cross-lingual, universal challenge for current AI architectures.

Conclusion and Implications

This research moves us beyond the “bigger is better” narrative. While scaling up parameters helps models handle complex sentence structures (syntax) and keep track of pronouns (coreference), it does not solve the fundamental problem of semantic ambiguity.

The methodology introduced here—using model disagreement (ROVER) combined with linguistic properties (Frames)—offers a powerful way to “stress test” models. It reveals that current state-of-the-art models struggle when:

The situation described has high ambiguity (High Entropy).
The context provided is sparse (Low number of Frame Elements).

For students and researchers entering the field, the implication is clear: To build the next generation of NLU (Natural Language Understanding) systems, we don’t just need more GPUs. We need training data and architectures that specifically address semantic ambiguity and context grounding. We need to teach models not just to predict the next word, but to understand the “Frame” of the reality they are describing.

The Problem with Black-Box Evaluation#

Background: What are Semantic Frames?#

The Methodology: A Two-Step Approach#

Step 1: Sorting by Difficulty (The ROVER Method)#

Step 2: Defining the “Why” (Complexity Factors)#

The Experiment: Models and Metrics#

A Note on Metrics#

Key Results: What Actually Makes Reading Hard?#

1. Size Solves Syntax, but not Semantics#

2. The Semantic Bottleneck#

3. The Ambiguity of Frames#

Visualizing the “Hard” Questions#

Generalization: Does this apply to English?#

Conclusion and Implications#