Imagine asking a powerful AI, “Who is the President of the United States?”
The answer seems simple, but for an AI processing millions of documents ingested from the internet, it is anything but. One document from 2008 might say Barack Obama. Another from 2024 says Joe Biden. A historical text might discuss the powers of the “POTUS” generally. When an AI encounters this, it usually forces a single answer, potentially hallucinating certainty where none exists.
This phenomenon is known as a Knowledge Conflict. As Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems ingest more data, the probability of encountering contradictory information skyrockets.
In a recent paper, researchers tackled this problem head-on. They propose a new framework for Adaptive Question Answering, teaching models not just to pick a winner, but to identify conflicting valid answers and—crucially—cite the sources for each.
The Problem: Hallucinating Certainty
Current state-of-the-art LLMs have a massive context window, allowing them to read and reason over long documents. However, when those documents disagree, models tend to fail in one of two ways:
- Selection Bias: They arbitrarily pick one answer and present it as the absolute truth.
- Confusion: They try to merge the answers into a nonsensical hybrid.
Most importantly, existing systems rarely tell you where they got the information. Without citations, a user has no way to verify which document supports which claim.

As shown in Figure 1 above, the goal is to move from a standard “Single Answer” output to a “Multi-Answer with Citation” output. Instead of just saying “Obama,” the model should recognize the conflict and state: “According to Document 1, it is Obama; According to Document 4, it is Biden.”
Background: Types of Ambiguity
To solve this, the researchers first had to categorize the different ways AI gets confused. They identified two main types of conflicts:
- Ambiguous Questions: The confusion comes from the query itself.
- Example: “Who played the Joker?”
- Conflict: This has multiple valid answers (Heath Ledger, Jack Nicholson, Joaquin Phoenix) depending on the movie context.
- Ambiguous Contexts: The confusion comes from the source documents provided to the model.
- Example: Different news reports giving different casualty numbers for the same event, or updated scientific data contradicting old data.
While previous research has looked at these problems in isolation, none have combined ambiguity resolution with source citation, especially in complex scenarios requiring multi-step (multi-hop) reasoning.
The Core Method: A New Framework for QA
The authors introduce a comprehensive framework to train and evaluate models on this task. This involved creating new datasets, defining new metrics, and testing three distinct approaches to solving the problem.
1. The Data: Injecting Citations
Since standard datasets don’t usually require models to cite specific conflicting documents, the researchers had to build their own. They took existing reading comprehension datasets and “augmented” them.
They injected citation metadata—tags like “Document 1” or “Document 2”—directly into the text before each paragraph. This creates a realistic simulation of a RAG system where an AI retrieves multiple snippets from a database.
They created diverse datasets to test different skills:
- AmbigQA-Cite: Focuses on ambiguous questions (e.g., “Who won the World Cup?” without specifying the year).
- DisentQA-DupliCite: Focuses on ambiguous contexts where two documents are almost identical but have one key difference (e.g., entity substitution).
- Conflicting HotPotQA-Cite: The “boss level.” This dataset requires multi-hop reasoning. To answer the question, the model must combine information from Document A and Document B, while avoiding contradictory information in Document C.
2. The Approaches
How do we get an LLM to handle this data? The researchers tested three main strategies:
- Zero-Shot: Feeding the conflicting data to the model and asking it to answer naturally (the baseline).
- Prompting: Using “Conflict-Aware” prompts that explicitly instruct the model: “You will get context texts that may conflict. If they do, answer by citing the specific documents.” They also tested Chain-of-Thought (CoT) prompting, asking the model to reason step-by-step.
- Fine-Tuning: Retraining the models (using LoRA, a parameter-efficient technique) specifically on these new conflict-heavy datasets.
3. The Models
The experiments were conducted across five diverse LLMs to ensure the results weren’t specific to just one architecture.

As listed in Table 1, the lineup included the Llama-2 family (7B, 13B, 70B), MPT-7B, and Falcon-7B. This mix of sizes helps determine if “smarter” (larger) models are naturally better at resolving conflicts.
Experiments & Results
The researchers introduced two critical metrics to measure success:
- Acc_K: Can the model find all the valid answers? (If there are 3 answers, finding just 1 isn’t good enough).
- Citation Accuracy (A_C): Did the model attribute the answer to the correct document?
Let’s look at what a “success” looks like in practice.

Table 2 shows the qualitative leap in performance. In the “Zero-shot Answer” column, models often pick one arbitrary answer (e.g., “Bill Pertwee”). In the “Few-shot Answer” column (using the new method), the model successfully breaks it down: “According to Document 1 the answer is Bill Pertwee. According to Document 2 the answer is Martin Savage.”
Scenario 1: Ambiguous Questions
When the question itself is vague (e.g., “Who voices Rocket Raccoon?”), how do models perform?

Table 3 reveals a harsh truth: Out-of-the-box models (Zero-Shot) fail at citations. They score 0.0% on Citation Accuracy (A_C). They simply aren’t built to provide sources naturally.
However, the Conflict-Aware (C.A.) Basic Prompting strategy significantly helps. For Llama-70B, using prompting boosted the ability to find a second answer (A_2) from 4.3% to 35.4%, and citation accuracy rose to nearly 50%.
Scenario 2: Ambiguous Contexts (Single-Hop)
Next, they looked at DisentQA, where the contexts are contradictory (e.g., one doc says a person was born in 1990, another says 1992).

Table 4 shows much higher scores here than in the ambiguous question setting. Because the conflicting text segments are often duplicates with just one word changed (Entity Substitution), it is easier for the model to spot the difference. Here, Prompting is highly effective, with Llama-70B achieving nearly 87% citation accuracy.
But what if the conflicting documents aren’t identical? What if they are paraphrased?

Table 5 shows the results for DisentQA-ParaCite. As expected, when the text is paraphrased (rewritten with different words but same meaning), the task becomes harder. Performance drops compared to the duplicate setting, but Fine-tuning begins to show its strength here, often outperforming prompting strategies.
Scenario 3: The Challenge of Multi-Hop Reasoning
This is the most significant finding of the paper. Conflicting HotPotQA represents complex, real-world research where you must connect dots across documents that disagree with each other.

Table 6 paints a clear picture: Prompting breaks down in complex scenarios. Look at the “C.A Basic” column for Llama-7B. It struggles to find the second answer (36.1%). Now look at the Finetuning column. It jumps to 90.0%.
In complex multi-hop scenarios, the cognitive load required to maintain the logic chain and handle citations via a prompt seems too high. Fine-tuning alters the model’s weights to fundamentally understand the task, leading to massive performance gains.
This trend holds even when “distractor” documents (irrelevant noise) are added, as shown in Table 7 below, though overall performance drops for all methods.

The Trade-off: Losing General Generality?
One concern with fine-tuning is “catastrophic forgetting”—does teaching the model to handle conflicts make it worse at normal questions?

Table 8 (referenced here as the final table in the deck) investigates this. When tested on data without conflicts:
- Prompting causes a slight performance drop (Llama-7B drops from 84.8% to 84.1%).
- Fine-tuning causes a significant drop (Llama-7B drops to 64.3%).
This suggests a “No Free Lunch” theorem. Fine-tuning makes a model a specialist at resolving conflicts, but it may become over-sensitive, trying to find conflicts where none exist.
Conclusion & Implications
This research highlights a critical gap in the current deployment of Large Language Models. As we rely on these systems for decision-making in law, medicine, and news, the ability to say “I found conflicting information” is arguably more important than just answering the question.
The authors demonstrated that:
- Standard LLMs are overconfident: They ignore conflicts and fail to cite sources by default.
- Prompting is a good first step: For simple conflicts, telling the model to “look for disagreements” works well.
- Fine-tuning is necessary for complexity: If your application involves complex reasoning (Multi-hop), prompts aren’t enough. You need to train the model on conflicting data.
This work paves the way for more trustworthy AI—systems that don’t just act as oracles, but as careful researchers, showing their work and letting the user decide what is true.
](https://deep-paper.org/en/paper/2410.04241/images/cover.png)