Introduction

In the rapidly evolving world of Large Language Models (LLMs), we have hit a peculiar wall: the students are becoming smarter than the tests. Benchmarks that were once considered difficult—covering everything from high school chemistry to professional law exams—are now being “saturated.” Models are scoring so high that it is becoming increasingly difficult to distinguish a good model from a great one.

When a benchmark gets saturated, researchers usually have two options. The first is to build a brand new, harder dataset from scratch. This is expensive, time-consuming, and requires expert human annotation. The second option is to take existing benchmarks and try to make them harder. Recent attempts have involved adding more “distractors” (wrong answers) to questions to lower the odds of guessing correctly. However, generating plausible distractors that don’t accidentally confuse the right answer is a massive challenge in itself.

But what if there was a simpler way? What if we could take any existing multiple-choice test and instantly make it significantly harder without writing a single new question?

Enter WiCkeD (Wild-Card Distractor), a new method proposed by researchers from the University of the Basque Country and Reka AI. Their approach is deceptively simple: randomly replace one of the options with “None of the above.”

This blog post will dive deep into the WiCkeD methodology. We will explore why this simple change wreaks havoc on modern LLMs, the algorithmic nuance required to implement it correctly, and what the results tell us about the reasoning capabilities of today’s most popular models.

Background: The Problem with Multiple Choice

To understand why WiCkeD is necessary, we first need to look at how we currently evaluate AI. Multiple Choice Question (MCQ) benchmarks, such as MMLU (Massive Multitask Language Understanding) or CommonsenseQA, are the industry standard. They consist of a question and a set of options (A, B, C, D).

However, these benchmarks suffer from a few critical vulnerabilities when applied to LLMs:

Process of Elimination: An LLM might not know the correct answer. However, if it can identify that options A, B, and C are definitely wrong, it will select D by default. It gets the point, but it didn’t actually “know” the answer—it just knew what the answer wasn’t.
Probability Bias: Research has shown that models often have biases toward specific option labels (like preferring ‘C’ over ‘A’) or rely on surface-level patterns rather than deep comprehension.
Lack of Negative Knowledge: One of the hardest things for an AI (and humans) to do is to recognize absence. Standard MCQs rarely test the ability to say, “I know enough about this topic to conclude that none of these answers are right.”

In educational psychology, “None of the above” is a well-known tool. It prevents students from guessing via elimination. To answer a “None of the above” question correctly, you must verify the falsity of every single distractor and the correctness of the answer. If the correct answer is missing, you must have the confidence to reject all provided options.

The researchers hypothesized that LLMs, which are essentially statistical prediction engines, would struggle significantly with this format.

The WiCkeD Methodology

The core contribution of this paper is a framework that automatically transforms any existing MCQ benchmark into a WiCkeD variant.

The Core Concept

The intuition is straightforward. In a standard MCQ, the model’s task is:

\[ \text{Select } \argmax(P(A), P(B), P(C), P(D)) \]

In a WiCkeD MCQ, one option is removed and replaced with “None of the above.”

Scenario 1: The algorithm removes a distractor (wrong answer). The correct answer remains. The model must recognize the correct answer and realize “None of the above” is wrong.
Scenario 2: The algorithm removes the correct answer. Now, “None of the above” becomes the correct choice. The model must recognize that all remaining options are incorrect.

Let’s look at a concrete example from the MMLU-Pro dataset to see how this changes the game.

Figure 1: Two samples from MMLU-Pro (left) and its WiCkeD variant (right).

In Figure 1 above, look at the second question regarding the force on a merry-go-round.

Left (Original): The correct answer is “Centrifugal” (Choice A). The model (Llama-3.1 8B) correctly identifies it with high confidence.
Right (WiCkeD): The option “Centrifugal” has been removed. Choice C is now “Torsal”, and Choice D is “None of the Above.” The actual correct answer is now Choice D (because Centrifugal is missing). However, the model incorrectly pivots to Choice C (“Torsal”).

This illustrates that while the model knew “Centrifugal” was associated with the question, it lacked the reasoning capability to realize that “Torsal” was wrong and that the true answer was missing.

The Challenge of Coherence: SBA vs. SCA

You might think you can just write a script to randomly swap options in any dataset. However, the researchers identified a critical flaw in that approach. Not all multiple-choice questions are built the same. Broadly, they fall into two categories:

Single Correct Answer (SCA): There is one factual truth. All other options are false. (e.g., “What is 2+2?” Options: 3, 4, 5. Only 4 is correct).
Single Best Answer (SBA): There might be multiple options that are technically true or partially true, but one is the most appropriate or specific. (e.g., “What is the best treatment for X?” Options might include two valid treatments, but one is the primary standard of care).

If you blindly apply the WiCkeD transformation to an SBA question, you might break the logic of the question.

Consider the example in Figure 2:

Figure 2: Applying WiCkeD on a single best answer (SBA) example.

In the top question (original), the user asks for the definition of “media convergence.” Option D is the best answer. Option A is the second best answer.

If the algorithm removes Option D (the best answer) and adds “None of the above,” the logical answer should conceptually be “None of the above” (because the best answer is gone). However, Option A (the second best answer) is still there. In the absence of D, Option A becomes the new “best” answer. If the benchmark marks “None of the above” as correct, it punishes the model for selecting A, which is actually a valid choice among the remaining options. This creates an incoherent dataset.

The Solution: An SBA Classifier

To solve this, the authors built a pipeline to filter out SBA questions.

They sampled questions from major benchmarks.
They used GPT-4o-mini to label them as SBA (Single Best Answer) or SCA (Single Correct Answer).
They trained a BERT-based classifier on these labels to be cost-effective and fast.

The rule for WiCkeD is: If a question is classified as SBA, copy it verbatim. Do not change it.

This ensures that the “None of the above” logic is only applied to factual questions where removing the answer definitely makes all other options false. This quality control step is vital for maintaining the validity of the benchmark.

Experimental Setup

The researchers applied WiCkeD to six popular benchmarks:

MMLU & MMLU-Pro: General knowledge and reasoning.
MMLU-Redux: A cleaner version of MMLU.
CommonsenseQA: Common sense reasoning.
Truthful-QA: Measuring model hallucinations.
Arc-challenge: Complex reasoning.

They evaluated 18 open-weight LLMs, including variants of:

Qwen-2.5 (7B, 14B, 72B)
Llama-3.1 (8B, 70B)
Mistral (7B)
Gemma-2 (9B, 27B)
DeepSeek-R1 (Distilled models)

Prompting Strategy

The models were evaluated using standard multiple-choice prompting. The probability of an answer \(a\) given context \(c\) and question \(q\) is calculated by the model.

Equation 1

The model ultimately selects the answer with the highest probability:

Equation 2

They used 5-shot prompting (providing 5 examples) to ensure the model sees at least one instance where “None of the above” is the correct answer, helping it understand the format.

Results and Analysis

The results were stark. Almost every model saw a massive performance drop when switching from the original benchmark to the WiCkeD variant.

The Performance Drop

Let’s examine the main results in Table 1.

Table 1: Average performance on original and WiCkeD variants of the six benchmarks.

The column \(\Delta\) (Delta) represents the drop in accuracy.

Significant Degradation: On average, models dropped by 12.1 points.
Qwen-2.5 7B suffered the worst hit, dropping nearly 19.7%. This suggests that while Qwen is excellent at standard benchmarks, it relies heavily on process of elimination or surface patterns that WiCkeD disrupts.
Robustness of Reasoning Models: The DeepSeek-R1 models (distilled versions) showed the smallest drops (around 7%). DeepSeek-R1 is known for its “reasoning” training (Chain of Thought). This implies that models trained to “think” rather than just predict tokens are better at handling the “None of the above” curveball.
Shuffling the Leaderboard: WiCkeD changes the rankings. Models that looked equal on MMLU suddenly showed gaps. For example, Qwen2.5-7B originally performed close to Llama-3.1-70B, but on WiCkeD, it lagged behind by 13%.

Does “Chain of Thought” Help?

One might argue that the models failed simply because they weren’t given enough time to “think.” If we use Chain of Thought (CoT) prompting—where the model is asked to explain its reasoning before answering—does the performance gap disappear?

The researchers tested this on MMLU, MMLU-Pro, and MMLU-Redux.

Table 2: Performance on WiCkeD variants with CoT.

As shown in Table 2, using CoT generally improves scores (the absolute numbers in the “CoT WiCkeD” column are higher than “Direct WiCkeD”). However, the degradation (\(\Delta\)) remains.

Even with CoT, the models perform significantly worse on WiCkeD than on the original datasets. This proves that the difficulty of WiCkeD isn’t just a formatting trick; it represents a genuine increase in the reasoning complexity required to solve the problem.

Interestingly, Instruction-Tuned (IT) models handled the transition better than Base models when using CoT.

Analyzing Model Behavior

Why do Instruction-Tuned models perform better with CoT? The researchers analyzed the specific changes in how models answered.

Figure 3: The changes in models’ answers of the original benchmarks and the WiCkeD variant using chain-of-thoughts.

Figure 3 visualizes three categories of behavior:

Consistent Responses (Blue): The model answers correctly in both Original and WiCkeD.
New Corrects (Orange): The model was wrong originally but got the WiCkeD version right (rare, but happens).
Reversed Corrects (Green): The model was right originally but failed WiCkeD.

The chart compares a Base model (left) vs. an Instruction-Tuned model (right). The Instruction-Tuned model has a slightly higher percentage of “New Corrects” and different distributions of errors. The qualitative analysis in the paper suggests that Instruction-Tuned models are less prone to hallucinating an answer when the correct one is missing; they are more willing to select the “None of the above” option when their reasoning leads them there.

Conclusion and Implications

The WiCkeD paper introduces a fascinating paradox: sometimes, the best way to test knowledge is to remove the answer.

By randomly replacing options with “None of the above,” the researchers exposed that many LLMs are “test-wise”—they are good at taking tests, but perhaps less robust in their actual knowledge than we thought. WiCkeD effectively neutralizes the “process of elimination” strategy and forces models to verify information more rigorously.

Key Takeaways:

Benchmarking is harder than it looks: High scores on MMLU don’t always mean high intelligence.
Absence is information: Detecting that a correct answer is missing requires higher-order reasoning than picking a correct answer that is present.
Reasoning models win: Models like DeepSeek-R1, which are optimized for reasoning, are much more resilient to this type of perturbation.

This method offers a cost-effective path forward for the AI community. Instead of spending millions creating new datasets, we can make our current ones “wicked” hard, pushing the next generation of models to be not just good guessers, but true reasoners.

Introduction#

Background: The Problem with Multiple Choice#

The WiCkeD Methodology#

The Core Concept#

The Challenge of Coherence: SBA vs. SCA#

The Solution: An SBA Classifier#

Experimental Setup#

Prompting Strategy#

Results and Analysis#

The Performance Drop#

Does “Chain of Thought” Help?#

Analyzing Model Behavior#

Conclusion and Implications#