Introduction
Imagine you are reading a dense legal contract or a complex medical journal. You aren’t an expert, so you turn to an AI assistant—like ChatGPT or a specialized document reader—to help you understand it. You ask a question based on your limited understanding: “What is the penalty if the tenant paints the walls?”
The AI scans the document and replies: “The document does not mention penalties for painting walls.”
Technically, the AI is correct. But it isn’t helpful. Perhaps the document talks about “alterations to the property” or “unauthorized modifications.” You, the user, didn’t know the right terminology, so you hit a dead end. This scenario highlights a significant gap in current Question Answering (QA) systems. While Large Language Models (LLMs) are getting better at telling us when they don’t know the answer, they are still quite poor at helping us ask the right question.
In this post, we will explore a fascinating paper titled “I Could’ve Asked That: Reformulating Unanswerable Questions” by researchers from Cornell University. We will dive into why users ask unanswerable questions, how we can evaluate an AI’s ability to fix them, and why even the most powerful models like GPT-4 struggle with this task.
The Problem: Presupposition Errors
When we seek information from unfamiliar documents, we often come to the table with assumptions. Sometimes, those assumptions are wrong. In the field of Natural Language Processing (NLP), these are often called presupposition errors.
A presupposition error occurs when a question takes something for granted that conflicts with the source text or simply isn’t there. For example, if you ask, “Why did the company fire the CEO in 2020?” but the CEO actually resigned, your question contains a false premise. The document cannot answer “Why they were fired” because the event never happened.
Research shows that around 30% of information-seeking questions contain these types of errors.
Current AI systems handle this by simply detecting the error and refusing to answer. While this prevents hallucinations (the AI making up fake reasons for the firing), it leaves the user stuck. A truly helpful assistant shouldn’t just say “No”; it should say, “I can’t answer that, but here is a relevant question I can answer.”
The Goal: Question Reformulation
The researchers propose a new task: Question Reformulation. The system must first detect that a question is unanswerable, and then rewrite it into a valid question that is grounded in the document and relevant to the user’s original intent.

As shown in Figure 1 above, the user asks about Justice Scalia’s decision to retire. The document, however, never mentions retirement; it discusses his appointment. A standard model might just say “Unanswerable.” The proposed system, however, recognizes the disconnect and suggests: “Why was Scalia appointed to be a Justice?” This pivots the conversation back to verifiable facts without losing the thread of the user’s interest.
Strategies for Fixing Questions
Before we can teach an AI to fix questions, we need to understand how humans do it. The researchers analyzed how humans reformulate bad questions to make them answerable. They identified several distinct strategies.

As detailed in Table 1, humans tend to use three main tactics when faced with an unanswerable premise:
- Correction: If the user implies a falsehood (e.g., “What foods are adulterated with turmeric?”), the human corrects it to match reality (“What substances are commonly adulterated in turmeric?”).
- Generalization: If the specific detail requested isn’t there (e.g., the exact date a law was passed), the human broadens the scope to something verifiable (e.g., “Has the law been proposed?”).
- Nearest Match / Specification: If the user asks a vague or slightly off-target question, the human refines it to the specific topic covered in the text. For example, changing “How many arms does Krishna have?” (unverifiable in the text) to “How is Krishna typically depicted?” (verifiable).
This human analysis serves as a roadmap. If an AI is going to be helpful, it needs to be capable of these kinds of semantic shifts—sometimes correcting a fact, sometimes zooming out, and sometimes zooming in.
The COULDASK Benchmark
To measure how well LLMs perform this task, the authors created a new benchmark called COULDASK.
Building a dataset for this is difficult because you need questions that are plausible but wrong. If the questions are total nonsense, the task is too easy. The questions need to be subtle, mirroring the confusion a real user might have.
The benchmark combines existing datasets (like SQuADv2) with newly generated datasets from three distinct domains:
- BBC News (News articles)
- Reddit (Social media stories)
- Yelp (Product/Service reviews)
To create the new data, they used a clever “adversarial” generation pipeline. They had one version of GPT-4 generate questions, and another version check them. They specifically filtered for questions that confused the checking model—questions where the model wasn’t sure if it could answer or not. These are the “hard” cases where reformulation is most needed.
How Do We Grade the AI?
Evaluating text generation is notoriously hard. Evaluating question reformulation is even harder because there isn’t just one “correct” new question.
The researchers devised a two-part metric called the Success Rate:
- Answerability: Can the new question actually be answered by the document? (They trained a Llama-2 classifier to judge this with 95% accuracy).
- Relevance: Is the new question actually related to what the user originally asked?
To measure relevance, they didn’t use standard text similarity (which can be fooled by similar wording). Instead, they looked at Entity Overlap. If the user asks about “Tesla” and “Inventions,” the reformulated question should ideally still involve “Tesla” and “Inventions,” even if the verb changes.

Table 3 illustrates the nuance of this evaluation. A reformulated question might be answerable but irrelevant (bottom right rows), or relevant but still unanswerable (top rows). A successful model must hit the “sweet spot”—the top row where both conditions are met.
Results: How Smart are Modern LLMs?
The researchers tested several state-of-the-art models, including proprietary models like GPT-4 and GPT-3.5, and open-source models like Llama-2 and Zephyr.
They used various prompting techniques, including:
- Zero-Shot (ZS): Just asking the model to fix the question.
- Few-Shot (FS): Showing the model examples of fixed questions.
- Chain-of-Thought (CoT): Asking the model to “think step-by-step” before generating the new question.
The Findings
The results were surprising. Despite the hype surrounding LLMs, they are remarkably bad at this specific task.

Table 5 shows the success rates. A few key takeaways:
- Low Success Rates: Even the mighty GPT-4 only achieves an average success rate of roughly 26% (using zero-shot prompting). This means 3 out of 4 times, it fails to provide a helpful, answerable reformulation.
- GPT-3.5 Struggles: GPT-3.5 is significantly worse, with success rates often in the single digits (average ~7%).
- Open Source vs. Proprietary: The open-source model Zephyr actually performed quite well compared to other open-source models, beating Llama-2 and Mistral, and sometimes rivaling GPT-4 on specific datasets.
- Prompting Doesn’t Always Help: While “Chain-of-Thought” (asking the model to think step-by-step) usually improves reasoning tasks, it didn’t consistently help here. In some cases, it actually hurt performance for reformulation.
Why Do the Models Fail?
To understand why the numbers were so low, the researchers performed a qualitative error analysis. They looked at the specific mistakes the models were making.

Table 6 breaks down the error categories. The results are telling:
- Parroting (62%): The most common error is that the model simply rephrases the original unanswerable question. It changes the grammar but keeps the false presupposition. For example, changing “When was he born?” to “When was Jay Chou born?” doesn’t help if the document never mentions his birth date.
- Irrelevance (17%): Sometimes the model hallucinates a new topic entirely. If the user asks about a band’s “most famous song,” the model might pivot to “famous musicians who like the band.” This is answerable, but not what the user wanted.
- Copy-Paste (14%): The model finds a sentence in the text that looks like the question and just turns that sentence into a question, regardless of whether it makes sense.
The Difficulty of “Global” vs. “Local” Edits
One of the most insightful contributions of this paper is the distinction between Local and Global edits.
- Local Edit (Short Span): The error in the question is small. Maybe just one word is wrong.
- Example: “When was USB-C developed?” vs “When was USB developed?”
- Global Edit (Q Span): The entire premise of the question is flawed. The whole sentence structure needs to change.
- Example: “What medicine is made from Coca?” (if the text only discusses Coca leaves generally).
The researchers hypothesized that “Global” edits are much harder for LLMs because they require compositional reasoning—you have to tear the question down and rebuild it.

Figure 3 confirms this hypothesis. The graph plots success (answerability) against the type of edit required.
- “q span” (Left side of the charts) represents questions requiring Global edits.
- “short span” (Right side of the charts) represents questions requiring Local edits.
For almost every model (GPT-4, ChatGPT, Zephyr), performance drops significantly when the model has to fix a “q span” error. This tells us that LLMs are decent at swapping out a noun (Entity A -> Entity B), but they struggle to restructure the logic of a query completely.

Table 7 provides concrete examples of this distinction. The “q span” example requires changing “What medicine is made from Coca?” to “What substance is derived from the Coca leaf?” This is a subtle semantic shift that requires understanding the whole relationship between the entities. The “short span” example is a simple keyword swap.
Conclusion and Implications
The “I Could’ve Asked That” paper sheds light on a critical user experience issue in AI. As we rely more on LLMs to parse vast amounts of information, we cannot assume users will always ask perfect questions.
The current state of technology, where models simply reject unanswerable questions, is insufficient. However, as the COULDASK benchmark reveals, moving to the next level—active reformulation—is a significant hurdle. Even our best models, like GPT-4, often fall into the trap of simply repeating the user’s error or hallucinating irrelevant topics.
The key takeaways for students and future researchers are:
- Detection \(\neq\) Correction: Identifying a bad question is easy; fixing it is hard.
- Structure Matters: Models struggle with “Global” edits that require rethinking the question’s logic.
- Evaluation is Key: We need creative metrics (like Entity Overlap) to judge open-ended generation tasks where there is no single right answer.
By solving the challenge of question reformulation, we can transform AI from a passive tool that says “I don’t know” into an active collaborator that says, “I don’t know that, but I can tell you this.”
](https://deep-paper.org/en/paper/file-3176/images/cover.png)