The Paradox of Intelligence—Why LLMs Fail at Easy Tasks While Acing Hard Ones

Introduction

Imagine you are tutoring a student in calculus. They effortlessly solve a complex Gaussian integral, showing a deep understanding of advanced mathematical concepts. Impressed, you ask them a follow-up question: “What is 17 times 8?” The student stares blankly and answers, “106.”

You would be baffled. In human cognition, capabilities are generally hierarchical; if you have mastered advanced calculus, it is taken for granted that you have mastered basic arithmetic. This is the essence of consistency.

However, Large Language Models (LLMs) like GPT-4 and Llama do not think like humans. While they have demonstrated expert-level capabilities in law, medicine, and coding, they suffer from a peculiar lack of robustness. A new research paper titled “Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?” explores a specific and counter-intuitive type of failure: the Hard-to-Easy Inconsistency.

Figure 1: A hard-to-easy inconsistency case of LLMs. A counter-intuitive phenomenon occurs when an LLM, which can solve a harder problem, surprisingly goes wrong on an easier problem.

As illustrated above, an LLM might correctly solve a complex integral yet fail a simple multiplication task. This blog post dives deep into this research, explaining how the authors quantified this paradox, the benchmark they created, and what this implies for the future of AI trustworthiness.

Background: The Consistency Problem

Before dissecting the new method, we must understand the landscape of LLM reliability. We know LLMs can be sensitive. Previous research has shown that:

Semantic Inconsistency: Rephrasing a question slightly can flip the model’s answer.
Order Sensitivity: Changing the order of options in a multiple-choice question changes the prediction.
Logical Inconsistency: A model might agree with a statement but disagree with its logical negation.

The authors of this paper argue that there is a more fundamental inconsistency that has been overlooked. It is the violation of the difficulty hierarchy. In a rational system, the set of skills required to solve an easy problem is a subset of the skills required for a harder version of that problem. Therefore, failing the easy problem while passing the hard one is a sign of a fundamental reasoning flaw.

The Core Method: ConsisEval

To study this phenomenon scientifically, the researchers couldn’t rely on random questions. They needed a controlled environment where “Easy” and “Hard” were rigorously defined. They introduced ConsisEval, a benchmark designed specifically to test Hard-to-Easy consistency.

1. Constructing the Dataset

The ConsisEval benchmark covers three critical domains: Code, Mathematics, and Instruction Following.

Unlike traditional benchmarks where questions are independent, ConsisEval uses pairwise data. Each entry contains an Easy Question ($a$) and a Hard Question ($b$).

Strict Order of Difficulty: The hard question is derived strictly from the easy one. It often contains the easy question as a sub-step or adds additional constraints. This ensures that mathematically, if you can solve $b$, you possess the logic to solve $a$.

The creation process was a hybrid of AI generation and human oversight:

Seed Data: Easy questions were taken from established datasets (like GSM8K for math or HumanEval for code).
GPT-4 Synthesis: The researchers prompted GPT-4 to take an easy question and “make it harder” by adding constraints or steps, ensuring the original logic remained a subset of the new problem.
Human Verification: Annotators rigorously checked the pairs to guarantee the difficulty hierarchy and correctness.

Figure 2: The hard data collection process of ConsisEval. An easy datum is fed into GPT-4 with a wel-esigned prompt and multiple hard data candidates are sampled. Human annotators select the one of best quality,then check and revise the sample to make it fit our criteria.

The result is a dataset where the relationship between questions is explicit. For example, in the table below, notice how the “Hard” question (blue text) is simply the “Easy” question (green text) with an added layer of complexity.

Table 1:An example question pair with a strict order of difficulty. Green text denotes the common part of questions and blue text denotes the additional part of hard question.

2. Defining the Consistency Score (CS)

How do we measure if a model is consistent? Accuracy alone isn’t enough. We need to look at the conditional probability.

The researchers define the Consistency Score (CS) as the probability that a model answers the easy question correctly, given that it has already answered the hard question correctly.

Visually, we can imagine this using a Venn diagram. In a consistent model (right side of the image below), the circle representing “Solving Hard Problems” is almost entirely contained within the circle of “Solving Easy Problems.” In an inconsistent model (left), there is a large area where the model solves the hard problem but misses the easy one.

Figure 3: Venn diagram for consistent/inconsistent models in complete probability space. The orange red circles and their overlap area denote the probability of a model correctly answering easy questions,hard questions,and both respectively. the overlap area of consistent models is much larger than that of inconsistent models.

Mathematically, the Consistency Score is calculated as:

Equation for Consistency Score

Or, expressed conceptually as conditional probability:

Equation Concept

Here, $P(a|b)$ represents the likelihood of success on the easy task ($a$) given success on the hard task ($b$).

3. The Relative Consistency Score (RCS)

There is a catch with the raw Consistency Score. A model that is terrible at everything (0% accuracy on both easy and hard) would technically have a high consistency score because it never encounters the “pass hard / fail easy” paradox. But that’s not useful.

To address this, the authors introduce the Relative Consistency Score (RCS). This metric contextualizes a model’s consistency against its raw capability.

Figure 4: Visualized expression of relative consistency score.

The RCS measures where a model sits between a theoretical “Lower Bound” (the worst possible consistency for its accuracy level) and an “Upper Bound” (the best possible consistency).

The formula normalizes the score:

Equation for RCS

To calculate this, they derived mathematical bounds based on the model’s performance on the dataset. The lower bound ($CS_{low}$) assumes the model’s success on easy and hard questions is independent (random):

Equation for CS Lower Bound

The upper bound ($CS_{upp}$) assumes the model is as consistent as theoretically possible given the difficulty gap:

Equation for CS Upper Bound

4. Estimating Probabilities

In standard benchmarks, we usually ask an LLM a question once (greedy decoding) and check if it’s right or wrong. However, LLMs are probabilistic engines. To get an accurate Consistency Score, we need the true probability ($P$) that a model solves a problem.

The researchers used sampling techniques. For open-source models, they sampled answers 20 times to estimate the probability:

Equation for Probability Estimation

For expensive closed-source models (like GPT-4), they used an Early Stopping technique to save costs while retaining statistical validity. They stop sampling as soon as a correct answer is found (since high-performance models usually get it right quickly), estimating the probability as:

Equation for Early Stopping Estimation

Experiments and Results

The researchers tested a wide array of models, including GPT-4, GPT-3.5, Claude-3 Opus, Llama-2/3, and Qwen. The results provide a fascinating snapshot of the current state of AI reliability.

Main Findings

The table below summarizes the performance across all three domains (Code, Instructions, Math).

$Table 2: Consistency evaluation results. A variety of LLMs are evaluated on code,instruction-following,and maths domains. On each domain,we report consistency score (CS),accuracy \$( \\% )\$ on hard set and easy set (denoted as Hard and Easy). We also report the average consistency score \$( \\mathbf { A v g } \\mathbf { C S } )\$ among three domains.$

Key Takeaways from the Data:

GPT-4 Turbo is the Consistency King: It achieved the highest average Consistency Score (92.2%). This suggests that stronger models are generally more rational in their problem-solving hierarchy.
Capability Correlates with Consistency: There is a strong positive relationship between a model’s raw accuracy on hard problems and its consistency score. As models get smarter, they tend to make fewer “stupid” mistakes.
Exceptions Exist: Interestingly, Claude-3 Opus, despite being a very strong model (sometimes outperforming GPT-4 on specific math tasks), had a slightly lower Consistency Score. This proves that high accuracy does not automatically guarantee high consistency.

Relative Consistency Analysis

When applying the Relative Consistency Score (RCS), we see that even the best models have room for improvement.

In the Code domain (shown below), GPT-4 Turbo has a high raw CS (88.1%), but its RCS is only 34.8%. This means that relative to its massive intelligence, it is still underperforming in consistency. It should be doing even better. Conversely, some weaker models like Llama-2-70B have high RCS, meaning they are very consistent despite being less capable overall—they know what they know, and they know what they don’t.

Figure 5: Relative consistency results in code domain (shown in ascending order of CS). Except for showing RCS for each evaluated model in a bar, we also show CS,upper and lower bounds of CS in lines of different colors for comparison.

Why Do LLMs Fail Easy Problems?

The numbers tell us that they fail, but the qualitative analysis tells us why. The authors conducted case studies to analyze specific instances where GPT-4 solved a hard problem but failed the easy version.

1. Distraction by Redundant Information

LLMs often struggle when easy problems contain “fluff” or extra details. In the example below, the model gets confused by the mention of “Thursday” in the easy prompt, misapplying it to the calculation.

Table 7: An inconsistency case for GPT-4. Red texts denote wrong reasoning steps. Misapplication of data occurs with GPT-4: Tuesday involves 1 hour of dancing,while GPT-4 mistakenly uses Thursday’s 2 hours in its calculation, resulting in a wrong final answer.

2. Overthinking and Misinterpretation

Sometimes, the model anticipates complexity that isn’t there. In this travel cost example, the model correctly calculates the complex scenario (Hard) but misinterprets the simple request in the Easy scenario, calculating only ticket costs instead of the total.

Table 8: An inconsistency case for GPT-4. Misunderstanding of the question leads to an error: the question askes for the total cost of the trip,whereas GPT-4 only calculates the cost of travel tickets.

3. Simple Computational Errors

Paradoxically, the “Hard” questions often force the model into a deeper reasoning mode (like Chain-of-Thought), which acts as a guardrail against errors. Easy questions might trigger a quicker, less careful generation path, leading to basic arithmetic fails.

Table 10: An inconsistency case for GPT-4. Red texts denote wrong reasoning steps. GPT-4 encounters a computational error while solving equations.

Implications: How Do We Fix It?

The paper concludes with two significant experiments regarding how to improve consistency.

1. Train on Harder Data

The researchers fine-tuned models on datasets with varying ratios of easy vs. hard data. The results were clear: Hard data enhances consistency.

As shown in Figure 6, as the proportion of hard data in the training set increases (x-axis), the Consistency Score (CS) rises. This suggests that exposing models to difficult reasoning patterns generalizes downwards to easier tasks better than the reverse.

Figure 6: Consistency of models fine-tuned on training sets of different proportions of easy and hard data.Finetuned models show higher consistency with more hard training data.

2. Use Hard Examples in Prompts

For users who can’t retrain models, the solution lies in In-Context Learning (ICL). When providing “few-shot” examples in a prompt, using hard examples yields better consistency than using easy examples.

Figure 7: Consistency behavior of ICL with easy and hard examples under 1-4 shot settings. ICL with harder examples shows higher consistency.

Conclusion

The research presented in “Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?” highlights a critical gap between artificial and human intelligence. While humans build knowledge like a pyramid—where a broad base of simple skills supports a peak of advanced capability—LLMs are more like a Jenga tower. They can reach dizzying heights of performance, but missing blocks near the bottom make them surprisingly unstable.

The introduction of ConsisEval and the Consistency Score gives the AI community a new lens through which to view model evaluation. It forces us to ask not just “How many questions did the model get right?” but “Does the model’s performance make logical sense?”

The findings offer a clear path forward: to build more trustworthy AI, we shouldn’t just focus on solving the hardest riddles. We must ensure that in the pursuit of genius, the models don’t lose their common sense. By training on harder data and rigorously testing against consistency benchmarks, we can move closer to AI that is not just powerful, but reliable.

Introduction#

Background: The Consistency Problem#

The Core Method: ConsisEval#

1. Constructing the Dataset#

2. Defining the Consistency Score (CS)#

3. The Relative Consistency Score (RCS)#

4. Estimating Probabilities#

Experiments and Results#

Main Findings#

Relative Consistency Analysis#

Why Do LLMs Fail Easy Problems?#

1. Distraction by Redundant Information#

2. Overthinking and Misinterpretation#

3. Simple Computational Errors#

Implications: How Do We Fix It?#

1. Train on Harder Data#

2. Use Hard Examples in Prompts#

Conclusion#