Why Students Fail: How AI is Learning to Diagnose and Generate Math Mistakes

If you have ever taken a multiple-choice math test, you know the feeling: you work through a problem, arrive at an answer, and look at the options. If your answer is there, you circle it. But what if your answer was wrong, yet it was still listed as an option?

These incorrect options are called distractors. In high-quality education, distractors aren’t just random numbers; they are carefully crafted traps designed to catch specific misunderstandings. For example, if the question asks for \(2^3\), a good distractor is \(6\) (catching students who multiplied \(2 \times 3\)) rather than \(5\) (which is just a random error).

Creating these “diagnostic” distractors is incredibly difficult and time-consuming for teachers. While Large Language Models (LLMs) like GPT-4 are great at solving math problems, they are surprisingly bad at predicting how a student might get a problem wrong. More importantly, they often struggle to explain the reasoning behind a wrong answer.

In a recent paper, researchers from the University of Massachusetts Amherst and Eedi introduced DiVERT (Distractor Generation with Variational Errors Represented as Text). This new AI framework doesn’t just generate wrong answers; it uncovers the specific “error story” behind them. By teaching a smaller, open-source model to think like a confused student, they outperformed massive proprietary models like GPT-4o in generating useful math distractors.

The Problem: Random Guesses vs. Diagnosable Errors

In educational assessment, the goal isn’t just to grade a student; it is to diagnose their knowledge gaps. If a student selects a specific distractor, a teacher should be able to say, “Ah, you forgot to carry the one,” or “You confused the diameter with the radius.”

Existing automated approaches fall into two buckets:

Rule-based systems: These are rigid and hard to scale beyond simple arithmetic.
LLM prompting: You ask a model like GPT-4 to “generate wrong options.”

The issue with standard LLM prompting is interpretability. An LLM might give you a plausible wrong number, but it rarely understands the precise causal chain that leads to it. Without that causal chain, the wrong answer is useless for diagnosis.

The researchers argue that to generate a good distractor (\(d\)), you first need to understand the latent error (\(e\)) that causes it.

Enter DiVERT: A Variational Approach

The core innovation of DiVERT is how it structures this problem. Instead of mapping a Question directly to a Wrong Answer (\(s \to d\)), it forces the AI to go through an intermediate step: the Error Explanation (\(e\)).

The researchers frame this using a Variational Autoencoder (VAE) framework. Typically, VAEs are used in image generation, where an image is compressed into a numerical vector (the “latent space”) and then reconstructed. DiVERT does something fascinating: it treats text as the latent space.

The system is composed of three specific LLMs working in tandem, as illustrated below:

Overview of DiVERT’s variational pipeline for error explanation and distractor generation in math MCQs.

Let’s break down the three components shown in Figure 1:

The Error Prior (\(p(e|s)\)): This model looks at the math question (\(s\)) and predicts a likely textual explanation of a student error (\(e\)). For example, “The student adds the numerators and denominators instead of finding a common denominator.”
The Distractor Generator (\(p(d|s, e)\)): This model takes the question and the specific error text generated by the previous model to calculate the resulting wrong answer (\(d\)).
The Error Identifier (\(q(e|s, d)\)): This is the “detective” model used during training. It looks at a question and a specific wrong answer, and tries to reverse-engineer the error explanation.

By training these models together, DiVERT learns a structured relationship between questions, misconceptions, and wrong answers.

The Mathematics of Mistakes

To make this system work, the model needs to maximize the likelihood that the generated distractor is a “good” one (i.e., one a student would actually pick). The mathematical probability of generating a distractor \(d\) given a question \(s\) is expressed as the sum over all possible errors:

Equation 1: Probability of distractor given question.

Since summing over all possible English sentences describing errors is impossible, the researchers use a technique called the Evidence Lower Bound (ELBO). This objective function balances two goals:

Reconstruction: Can the model accurately generate the correct distractor given the error?
Regularization: Does the “detective” model’s explanation match what we generally expect errors to look like?

The training objective is mathematically defined as:

Equation 2: The ELBO training objective.

The “Soft Token” Trick

There is a major technical hurdle here. In standard machine learning, you need the entire system to be differentiable so you can use backpropagation (the algorithm that updates the AI’s brain). However, text is discrete. You can’t slightly adjust the word “add” to become “multiply” using calculus.

To solve this, DiVERT uses soft tokens. Instead of selecting a hard word (like “numerator”), the model passes a probability distribution (a weighted mix of words) to the next stage. This allows the gradient information to flow backward through the “text” bottleneck, allowing the system to learn end-to-end.

To keep the model from drifting too far into nonsense during this process, they introduce a regularization term. This ensures the model’s error explanations remain grounded in the initial training data provided by teachers.

Equation 3: Regularization term.

Experiments: David vs. Goliath

The researchers tested DiVERT using a real-world dataset from Eedi, a math learning platform. The dataset contained over 1,400 math questions with distractors and error explanations written by expert teachers.

They compared DiVERT (built on a 7-billion parameter open-source model, Mistral) against GPT-4o, the state-of-the-art proprietary model which is orders of magnitude larger.

Quantitative Results

The results were striking. Despite being a much smaller model, DiVERT outperformed or matched GPT-4o on “Distractor Generation”—the ability to create wrong answers that align with real teacher-authored questions.

One of the most robust metrics they used was Prop@10 (Proportional Match), which measures what percentage of the ground-truth distractors were found in the AI’s top 10 guesses.

As seen in Table 6 (provided in the supplementary analysis), DiVERT achieved a score of 68.75, beating the best GPT-4o method which scored 63.89.

Table 6: Single fold performance on distractor generation for additional baselines and reference methods.

Learning with Less Data

One of DiVERT’s biggest strengths is its efficiency. Because it uses a variational approach (learning the structure of errors), it can learn effectively even when it doesn’t have a labeled error explanation for every single question.

Figure 2 below shows what happens when you hide error labels from the model. Even when 80% of the training data has no error explanations provided, DiVERT (the teal line) maintains high performance, whereas standard fine-tuning methods (the yellow and magenta lines) degrade rapidly.

Distractor generation performance with increasing percentages of error labels dropped.

Qualitative Analysis: Does it make sense?

Numbers are great, but in education, the quality of the content matters most. Does DiVERT generate errors that sound like real students?

Table 3 compares the output of DiVERT against GPT-4o for a question about the Lowest Common Multiple (LCM).

Examples of errors and corresponding distractors generated by different approaches.

Notice the difference:

GPT-4o (Zero-shot) hallucinates complex concepts like “prime factors” that might not be relevant to the specific mistake resulting in ‘5’.
DiVERT identifies a very specific, grounded misconception: “Thinks they can just give any multiple of one of the numbers,” leading to the answer ‘15’. This is a highly plausible student error.

What happens when it fails?

DiVERT isn’t perfect. The researchers performed an error analysis (Table 4) and found that the most common failure mode is consistency. Sometimes, the model generates a perfect error description but calculates the wrong number for that error (or vice versa).

Qualitative error analyses of generated errors and corresponding distractors.

In the example above, the model correctly identifies the error “divides the denominator,” but the resulting number doesn’t quite match that logic. This disconnect remains a challenge for future work.

Human Evaluation: The Teacher Test

Finally, the researchers asked real math teachers to blindly rate the quality of error explanations generated by humans, DiVERT, and GPT-4o. They rated them on relevance, correctness, and plausibility.

Table 5: Average error quality rated by math teachers.

The results in Table 5 are a significant win for open-source AI. Teachers rated DiVERT’s errors (3.07) as statistically equivalent to Human-authored errors (3.23). Both significantly outperformed GPT-4o (2.56).

This suggests that generic “smart” models like GPT-4o often overcomplicate student errors or hallucinate reasons that don’t match the reality of a middle-school classroom. DiVERT, trained specifically to mimic student misconceptions, captures the “ground truth” of the classroom much better.

Conclusion: The Future of Personalized Learning

DiVERT demonstrates a crucial shift in how we apply AI to education. It moves beyond simply generating content (questions and answers) to generating pedagogical data (explanations of misconceptions).

By treating “errors” as a text-based latent variable, the model forces itself to be interpretable. It cannot just guess a wrong number; it must articulate why that number is a plausible mistake.

This technology opens the door for highly personalized automated tutoring systems. Instead of a generic “Incorrect, try again,” a system powered by DiVERT could look at a student’s wrong answer, map it to the latent text error, and provide feedback like: “It looks like you multiplied the numerators, but remember—when dividing fractions, we flip the second fraction first.”

For students and teachers alike, that difference is everything.

The Problem: Random Guesses vs. Diagnosable Errors#

Enter DiVERT: A Variational Approach#

The Mathematics of Mistakes#

The “Soft Token” Trick#

Experiments: David vs. Goliath#

Quantitative Results#

Learning with Less Data#

Qualitative Analysis: Does it make sense?#

What happens when it fails?#

Human Evaluation: The Teacher Test#

Conclusion: The Future of Personalized Learning#