Beyond “I Don’t Know”: Teaching LLMs to Explain the Unknown
We have all experienced that moment when interacting with a Large Language Model (LLM): you ask a question, and the model answers with absolute, unwavering confidence. It sounds plausible, the grammar is perfect, and the logic seems sound. But then you realize—it’s completely made up.
This phenomenon, often called “hallucination,” is particularly dangerous when the model is faced with Unknown Questions. These are questions that don’t actually have a definitive answer. They might be based on false premises, ask about the future, or be linguistically ambiguous.
When an LLM faces a question like, “Who won the Super Bowl in 2035?”, a standard model might hallucinate a team. A more safety-tuned model might simply refuse: “I don’t know” or “I cannot answer that.”
But is a simple refusal enough? Imagine a human expert. If you asked a historian, “When did Napoleon use the internet?”, they wouldn’t just say “I don’t know.” They would correct you: “Napoleon couldn’t have used the internet because he died nearly two centuries before it was invented.”
This blog post dives deep into a fascinating research paper titled “Don’t Just Say ‘I don’t know’! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations.” We will explore how researchers are using a method called Self-Alignment to teach LLMs not just to refuse unknown questions, but to explain why they are unanswerable.
The Problem: Overconfidence and the “Unknown”
LLMs are trained to predict the next token. They are excellent at continuing a pattern. If you provide a question, the most natural pattern completion is an answer. This leads to overconfidence.
The researchers identify a critical gap in current AI safety measures. Most existing approaches focus on refusal—teaching the model to shut down when it detects uncertainty. However, the authors argue that a trustworthy LLM should exhibit “response-ability.” It should be able to:
- Detect that a question is unknown.
- Classify why it is unknown (e.g., is it a false premise? is it futuristic?).
- Explain the reasoning to the user.
Let’s look at a concrete example provided in the paper to illustrate the difference between these responses.

In Figure 1, the user asks: “What animal can be found at the top of the men’s Wimbledon trophy?”
- The First Answer (Hallucination): The model confidently invents a “falcon.” This is factually wrong.
- The Second Answer (Refusal): “The answer is unknown.” This is safe, but unhelpful.
- The Fourth Answer (Self-Aligned): This is the gold standard. The model corrects the premise: the trophy actually features a pineapple, not an animal.
Achieving this level of nuance usually requires massive amounts of human-annotated training data, which is expensive and slow to produce. The core innovation of this paper is a method to achieve this without massive human intervention, using the LLM to teach itself.
The Solution: Self-Alignment
The proposed method is called Self-Align. The intuition is clever: while an LLM might struggle to answer unknown questions correctly out of the box, it often has the latent knowledge required to construct these questions and evaluate them if guided properly.
The process leverages a small amount of “seed data” (just a handful of examples) and a large dataset of standard, known questions (like widely available QA datasets). It then uses the LLM to transform these known questions into unknown ones, generating its own training data.

As shown in Figure 2, the workflow consists of four distinct stages:
- Guided Question Rewriting: Turning known questions into unknown ones.
- Conditioned Response Generation: Answering these new questions with explanations.
- Disparity-driven Self-Curation: Filtering out bad data (a crucial quality control step).
- Supervised Fine-Tuning (SFT): Training the model on this curated data.
Let’s break down each stage of this “Iterative Self-Alignment” process.
Stage 1: Guided Question Rewriting
The process begins with a large set of standard “Known Questions” (e.g., “Who won the 1996 Olympics?”). The goal is to mutate these into “Unknown Questions.”
The researchers identified four main categories of unknown questions:
- Incomplete: Lacking necessary details.
- Futuristic: Asking about events that haven’t happened.
- Incorrect: Containing false assumptions.
- Ambiguous: Linguistically unclear or having multiple interpretations.
The model is given a few human-written examples (seed data) of how to rewrite a question. For instance, to create a Futuristic question, the seed might show how to change a past date to a future date.
The mathematical formulation for this generation process is:

Here, \(\mathcal{D}_{uq}^c\) represents the generated unknown questions for a specific class \(c\) (like “Futuristic”). The model \(\mathcal{M}\) takes a prompt \(z_{qr}^c\), the seed data \(\mathcal{D}_{seed}^c\), and a known question \(q\) to produce the new unknown question.
For example, if the known question is “Who was the governor of Texas in 2003?”, the model might rewrite it as “Who will be the governor of Texas in 2033?”
Stage 2: Conditioned Response Generation
Now that the system has generated thousands of “Unknown Questions,” it needs to generate the ideal responses—the “explanations” we desire.
The researchers use “Class-Aware Prompts.” Since the system knows it just generated a Futuristic question, it prompts the LLM specifically to explain why a futuristic question cannot be answered definitively.

In this equation, the model generates a response by looking at both the new unknown question (\(p_i\)) and the original known question (\(q_i\)). This allows the model to contrast the two.
For an Incorrect question like “What animal is on the Wimbledon trophy?”, the prompt instructs the model: “The following question is incorrect. Please answer the question by pointing out its incorrectness.”
Stage 3: Disparity-driven Self-Curation
This is arguably the most innovative part of the paper.
When an LLM generates its own training data, there is a high risk of noise. Sometimes the “rewriting” fails. Maybe the model tries to rewrite “Who is the President?” into a futuristic question but just outputs “Who is the current President?” again. If we train on this, the model will learn to refuse valid questions.
Standard self-alignment methods often ask the model to “score” its own output (e.g., “Rate this response from 1 to 5”). The authors found this doesn’t work well for unknown questions because the model is often confused about whether the question is actually unknown.
Instead, they propose Disparity-driven Self-Curation.
The idea relies on semantic difference. The system presents the model with two pairs:
- The Original Known Pair: (Known Question, Known Answer)
- The Generated Unknown Pair: (Unknown Question, Explanation Response)
It then asks the model to score the disparity (difference) between these two pairs.

If the rewriting was successful, the disparity should be high.
- Known Pair: “Who won in 2000?” -> “Person X.”
- Unknown Pair: “Who will win in 3000?” -> “I cannot answer because it is in the future.”
These two interactions are semantically very different. Therefore, the disparity score is high, and the data is kept.
If the rewriting failed (the question remained answerable), the disparity score would be low, and the data is discarded. This filter acts as a high-quality sieve, ensuring only the best examples make it to the training set.
Stage 4: Supervised Fine-Tuning & Iteration
Finally, the curated dataset is used to fine-tune the base model.

This is a standard supervised learning objective, maximizing the probability of the correct explanation (\(r\)) given the unknown question (\(p\)).
Crucially, this process is Iterative. Once the model is fine-tuned (becoming \(\mathcal{M}^{(1)}\)), it is better at understanding unknown questions. It is then used to generate new data, which is curated and used to train \(\mathcal{M}^{(2)}\), and so on. The model bootstraps its own capabilities, getting smarter with each loop.
Experiments and Results
To validate this approach, the researchers tested the Self-Align method against several strong baselines, including prompt-engineering methods (like Self-Ask) and other fine-tuning strategies.
The Setup
They utilized two main datasets for evaluation: QNotA (a public dataset) and a newly created dataset called KUQP (Known-Unknown Question Pairs). They covered the four categories of unknown questions: Incomplete, Futuristic, Incorrect, and Ambiguous.

They tested the method on two popular open-source LLMs: Vicuna-7B and LLaMA-2-7B.
The evaluation covered three distinct tasks:
Task 1: Unknown Question Detection
Can the model simply identify if a question is unknown? This is a binary classification task (Known vs. Unknown).
The results showed that prompt-based baselines (asking the model “Is this known?”) were inconsistent and sensitive to how the prompt was phrased. However, the Self-Aligned method consistently outperformed the baselines.
The paper notes an interesting observation: standard LLaMA-2 was generally more overconfident than Vicuna, often failing to detect unknown questions. After Self-Alignment, both models saw significant improvements.
Task 2: Unknown Question Classification
Can the model identify why the question is unknown (e.g., classifying it as “Ambiguous” or “Futuristic”)?

As shown in Table 3, the Self-Aligned method achieved the highest F1 scores across the board. For the Vicuna model, the F1 score jumped from a mere 0.076 (Zero-shot) to 0.436 (Self-Aligned) on the QNotA dataset. This is a massive improvement, indicating that the fine-tuning process successfully taught the model the nuances of why information might be missing.
Task 3: Open-ended Response Generation
This is the ultimate test: generating the actual text response. Since there is no single “correct” sentence, the researchers used GPT-4 to act as a judge, comparing the Self-Aligned responses against baseline responses (win-rate analysis), and also conducted human evaluation.

Table 4 shows the win rates. If the number is above 0.500, the Self-Aligned model beat the baseline. You can see that Self-Aligned (K=3) (meaning 3 iterations of training) consistently beats almost every baseline, usually with win rates between 60% and 90%.
The researchers also performed a Human Evaluation (Table 5 in the paper, visualized in the deck as Table 6 for self-augmented data specifically). Human annotators rated responses on Honesty, Comprehensibility, and Helpfulness.

The Self-Augmented data (the data generated by the model for training) was rated significantly higher than the Zero-shot attempts, proving that the conditioned generation and disparity curation processes yield high-quality training material.
Why Does It Work? The Power of Disparity
One of the most significant findings of the paper is the validation of the Disparity-driven Self-Curation method.
The researchers compared their method against a “Principle-driven” approach (where the model is just asked “Is this a good training example?”).

In Figure 3, the Blue bars represent the Disparity method. You can see it consistently outperforms the “No Curation” (Yellow) and “Principle-driven” (Pink/Red) methods.
This confirms the hypothesis that LLMs are better at spotting semantic differences (Disparity) than they are at making absolute judgments about quality. It is easier for the model to say “These two sentences mean very different things” than “This sentence is a valid unknown question.”
The Impact of Iteration
Does the model keep getting better if you repeat the process?

Figure 4 shows the performance over 3 iterations (K=0 to K=3).
- Task 1 (Left): Detection accuracy steadily climbs.
- Task 2 (Right): Classification ability improves sharply after the first iteration and then plateaus/stabilizes.
This suggests that the model effectively “bootstraps” itself. The smarter it gets, the better training data it generates for the next round, which makes it even smarter.
Broader Implications and Conclusion
The “Self-Align” paper presents a compelling path forward for AI safety and reliability. By shifting the goalpost from simple refusal (“I don’t know”) to explanation (“Here is why that premise is flawed”), we create AI systems that are more helpful and trustworthy.
The most exciting aspect is the scalability. Because the method relies on the LLM to generate its own training data, it doesn’t require thousands of hours of expensive human labor. A few seeds are enough to grow a forest of high-quality training examples.
Key Takeaways:
- Proactive Explanation: LLMs should explain why they can’t answer, correcting user misconceptions.
- Self-Augmentation: LLMs can generate their own training data for unknown questions by rewriting known ones.
- Disparity Filtering: Comparing “Known” and “Unknown” pairs is a superior way to filter AI-generated training data.
- Trust: An AI that explains its limitations is an AI we can trust more deeply.
As we move toward more autonomous agents, the ability for a model to recognize its boundaries—and explain them clearly—will be a defining feature of advanced intelligence.
](https://deep-paper.org/en/paper/file-2982/images/cover.png)