If you have ever taken a multiple-choice exam, you know the drill: you read a question (the stem), identify the correct answer, and ignore the other options. Those incorrect options have a specific name: distractors.
For a student, distractors are merely hurdles. For an educator, however, creating them is a massive design challenge. A good distractor must be plausible enough to test the student’s understanding but clearly incorrect to avoid ambiguity. If the distractors are too easy, the test is useless; if they are confusingly similar to the answer, the test is unfair.
With the rise of Artificial Intelligence, the laborious task of manually writing these options is shifting toward automation. A recent comprehensive survey by Alhazmi et al. explores the field of Distractor Generation (DG). This post breaks down their research, explaining how AI models—from early rule-based systems to modern Large Language Models (LLMs)—are learning the subtle art of being wrong.
The Landscape of Distractor Generation
Distractor Generation is a subtask of Natural Language Generation (NLG). The objective is simple: given a question and a correct answer (and often a supporting passage), generate a set of options that are semantically relevant but factually incorrect.
The researchers categorize DG into two primary domains based on the type of assessment:
- Fill-in-the-Blank (FITB): Also known as Cloze tests. The system must predict incorrect words or phrases to fill a gap in a sentence.
- Multiple-Choice Questions (MCQ): This includes standard Question Answering (QA), Reading Comprehension (RC), and Multi-modal tasks (involving images).
To visualize how extensive this field has become, the authors provide a taxonomy of the current research landscape, covering tasks, datasets, and methods.

As shown in Figure 1, the field has moved rapidly from traditional methods toward deep neural networks and pre-trained models. Before we dive into the complex architectures, let’s look at the tasks themselves.
The Tasks: Text and Vision
The most common form of DG deals with text. For example, in a science exam, if the answer is “lungs,” the system needs to understand the context of “respiratory system” to suggest “kidneys” or “intestines” rather than “steering wheel” or “happiness.”
However, the field is evolving beyond text. Multi-modal Question Answering requires the AI to look at an image and generate distractors based on visual cues.

In the example above (Figure 2), the model must recognize the objects in the image. If the question asks “What is white?”, and the answer is “Vanilla ice cream,” the distractors must be other objects present or plausible in the scene (like “Plates” or “Snow”), but incorrect for the specific bounding box being queried.
Similarly, there is Visual Cloze (shown below), where the “blank” is a missing image in a sequence or a recipe.

Here, the AI must understand the temporal or logical sequence of a recipe (e.g., cutting fruit comes before serving it) to generate or select images that look related but are contextually wrong.
The Evolution of Methods
The survey details a fascinating progression in how computers generate these distractors. We can divide this timeline into three distinct eras: Traditional, Deep Neural Networks, and Pre-trained Language Models.
1. Traditional Methods: Rules and Ontologies
In the early days, DG relied on strict rules and static databases.
- Corpus-based: These methods analyze word frequency and grammar. If the correct answer is a past-tense verb like “ran,” the system looks for other past-tense verbs.
- Knowledge-based: These rely on structured databases like WordNet. If the answer is “Dog,” the system looks at the ontology tree to find “siblings” of the concept, such as “Cat” or “Wolf.”
While these methods ensure the distractors are related, they often lack context. A knowledge base might suggest “Bark” as a distractor for “Trunk” because they are both tree parts, but if the question is about an elephant, “Bark” makes no sense.
2. Deep Neural Networks (DNNs)
The introduction of Sequence-to-Sequence (Seq2Seq) models changed the game. Instead of looking up words in a database, models began “reading” the passage and “generating” distractors word-by-word.
Key architectures in this era include:
- Hierarchical Encoder-Decoder (HRED): This model processes the passage at two levels—word-level and sentence-level. It uses attention mechanisms to focus on specific parts of the text that are relevant to the question but not the correct answer.
- Static vs. Dynamic Attention: Researchers developed mechanisms to ensure the model doesn’t accidentally generate the correct answer. By using “negative answer regularization,” the model is penalized if its generated distractor is too similar to the actual answer.
3. Pre-trained Language Models (PLMs)
This is the current state-of-the-art. Models like BERT, T5, and GPT have revolutionized DG because they have read massive amounts of text and understand nuance.
There are two main ways PLMs are used:
- Fine-Tuning: Taking a model like T5 (which is designed to convert text-to-text) and training it specifically on datasets of exam questions.
- Prompting: Using Large Language Models (LLMs) like GPT-3 or GPT-4. This approach requires no retraining; you simply give the model instructions.
The survey highlights different prompting strategies that yield different results.

As illustrated in Figure 3, prompt engineering is critical:
- Template Learning (Single-stage): You mask the answer and ask the model to fill it in.
- Template Learning (Multi-stage): A “Chain of Thought” approach. The model first extracts keywords, generates a question, and then generates the distractor. This mimics human reasoning.
- In-Context Learning (Few-shot): You show the model examples of good questions and distractors (highlighted in red and green in the image) before asking it to generate a new one. This drastically improves the output by setting a pattern for the AI to follow.
Experiments and Evaluation: Is the AI actually good at this?
How do we judge if a computer is good at making up wrong answers? The researchers discuss two categories of evaluation: Automatic and Manual.
Automatic Metrics
Researchers use metrics like BLEU and ROUGE, which are standard in translation tasks. They measure how many words the AI’s generated distractor shares with a “Gold Standard” distractor written by a human.
- The Problem with Metrics: These metrics are often flawed for DG. If the human wrote “Cat” and the AI wrote “Kitten,” the BLEU score might be low (bad match), even though “Kitten” is a perfectly valid distractor.
Qualitative Analysis (The Real Test)
The survey provides a critical look at where these models fail. Despite the power of LLMs, they struggle with three main pillars of valid distractors: Plausibility, Reliability, and Diversity.
Reliability Issues: Sometimes, the AI generates a distractor that is actually correct, invalidating the question. Other times, it generates options that are nonsensical.

Table 4 highlights these specific failures:
- Valid Answer Error: In example (1), the model suggests “glucose” as a distractor for an energy source. The problem? Glucose is a main source of energy. The distractor is correct, making the question broken.
- Context Error: In example (2), the answer is “fair.” The BART model suggests “unfair.” While logically opposite, in a multiple-choice context, an option that is the direct antonym is often too obviously wrong (or “obviously wrong” as labeled in the table), making the test too easy.
- Repetition: In example (3), the T5 model gets stuck in a loop, generating “think, think, think.”
Validity in Reading Comprehension: The challenge increases when the AI has to generate full sentences rather than single words.

Table 5 reveals deeper semantic issues:
- Semantic Similarity: In example (1), the model generates “Radiation is harmless” and “Radiation can’t hurt us.” These mean the same thing. If a student sees two options that mean the same thing, they can often eliminate both immediately (since there can only be one correct answer). This is a “test-wiseness” flaw.
- Bias: In example (2), the model relies on societal biases found in its training data (e.g., associating “attractive” with specific physical traits not mentioned in the text), which creates unfair assessment items.
Conclusion and Future Directions
Alhazmi et al. conclude that while we have moved far beyond simple dictionary lookups, Distractor Generation is not yet a solved problem.
The current generation of AI models excels at fluency—they write grammatically correct sentences. However, they struggle with the logic required for educational assessment. A good distractor requires a “Theory of Mind”—understanding what a student might misunderstand.
Key Future Directions Identified:
- Trustworthy Generation: Reducing hallucinations and ensuring the “wrong” answers aren’t accidentally “right.”
- Educational Deployment: Integrating these models into real learning platforms, but with a “Human-in-the-Loop” to verify quality.
- Multi-modal Expansion: Moving deeper into video and audio distractors, which remain largely unexplored.
For students and developers interested in NLP, this field offers a unique challenge: it is one of the few areas in AI where the goal is to generate false information that is believable, yet verifiable as false. As models get smarter, the line between a “good distractor” and a “hallucination” will become the next frontier of research.
](https://deep-paper.org/en/paper/2402.01512/images/cover.png)