In the age of Retrieval-Augmented Generation (RAG), we often view Large Language Models (LLMs) as sophisticated search engines that summarize the truth. We ask a question, the system retrieves documents from the web, and the LLM synthesizes an answer. But what happens when the internet disagrees with itself?

Imagine asking, “Is coffee good for you?” One retrieved article cites a study claiming it reduces heart disease risk; another claims it causes hypertension. These are inter-evidence conflicts. They aren’t hallucinations by the model; they are contradictions inherent in the data sources.

A recent research paper, “ECON: On the Detection and Resolution of Evidence Conflicts,” tackles this precise problem. The researchers introduce a robust framework for generating synthetic conflicts to test how well AI models can detect when information clashes and how they resolve those clashes when forced to give an answer.

This blog post dives deep into the ECON paper, exploring how they built this dataset, the limitations of current AI in spotting contradictions, and the biased behaviors LLMs exhibit when caught in the middle of a factual argument.

The Problem: The Messy Reality of RAG

Decision-making systems rarely operate on perfect information. Whether it is Wikipedia edits, news reports, or medical advice, conflicting information is ubiquitous. The rise of AI-generated content adds another layer of complexity: malicious actors can mass-produce convincing misinformation to pollute search results.

Previous research has largely focused on two areas:

  1. Hallucination: When the model says something not found in the source text.
  2. Parametric Conflicts: When the retrieved text contradicts what the model memorized during training (e.g., the model “knows” the sky is blue, but the text says it’s green).

ECON focuses on a third, less explored area: Context vs. Context. When Document A says \(X\) and Document B says \(Y\), can the model realize that \(X\) and \(Y\) cannot both be true?

The Solution: Generating High-Quality Conflicts (The ECON Method)

To evaluate models, we need a benchmark. However, waiting for humans to label thousands of contradictory web pages is slow and expensive. The authors propose a clever automated pipeline to generate Answer Conflicts and Factoid Conflicts.

1. Generating Answer Conflicts

The most direct form of conflict is when two documents support different answers to the same question.

As illustrated in Figure 1 below, the process begins with a question (\(q\)) and its ground-truth answer (\(a_0\)).

  1. Alternative Answer Generation: An LLM generates plausible but incorrect alternative answers (\(a_1, a_2, \dots\)).
  2. Evidence Generation: The model creates supporting evidence (\(e_i\)) for each answer.
  3. Quality Check: A rigorous validation step ensures that the generated evidence (\(e_i\)) logically entails the generated answer (\(a_i\)).

Figure 1: Generating evidence pairs with answer conflicts. For each question and its ground-truth answers, alternative answers are generated (shown in red boxes). Subsequently, a piece of supporting evidence is generated for each answer, which is validated by a checker to ensure quality.

This process results in pairs of evidence \((e_i, e_j)\) that look legitimate but support mutually exclusive conclusions.

The mathematical formulation for generating the answers and evidence is straightforward:

\[ \{ a _ { i } | i = [ 1 , 2 , \cdots ] \} = \mathsf { A n s w e r G e n } ( q , a _ { 0 } ) \]\[ \boldsymbol { e } _ { i } = \mathsf { E v i d e n c e G e n } ( \boldsymbol { q } , \boldsymbol { a } _ { i } ) \]

2. Generating Factoid Conflicts

Real-world contradictions are often more subtle than just getting the final answer wrong. They might disagree on dates, entities, or specific details within a larger narrative. To simulate this, the authors introduce Factoid Conflicts.

Here, evidence is treated as a set of atomic facts, or “factoids” (\(S\)). The system creates conflict by semantically perturbing specific factoids.

Figure 3: Generating evidence pairs with factoid conflicts.

As shown in Figure 3, the system takes a factoid (e.g., “Shrimp contains microplastics”) and perturbs it to its opposite (e.g., “Shrimp is devoid of microplastics”).

\[ s _ { k } ^ { p } = \mathsf { P e r t u r b } ( s _ { k } ) \]

The generator then writes a full paragraph of evidence based on these perturbed factoids.

\[ e ^ { i } = { \mathsf { E v i d e n c e G e n } } ( q , \{ s _ { 1 } ^ { p _ { 1 } ^ { i } } , s _ { 2 } ^ { p _ { 2 } ^ { i } } , \cdots \} ) \]

Crucially, this allows the researchers to control the intensity of the conflict. By calculating how many factoids in Set A contradict Set B, they can assign a conflict score \(\hat{f}\).

\[ { \hat { f } } ( e ^ { i } , e ^ { j } ) = { \frac { \operatorname { S u m } ( p ^ { i } \oplus p ^ { j } ) } { n } } \]

This creates a rich dataset containing various types of conflicts, such as Temporal (dates), Numerical, or Negation (did/did not).

Table 1: Example conflicting evidence pairs. Spans in brown colour highlight the conflicting part

Experiment 1: Can AI Detect Conflicts?

The first research question (RQ1) is: How well can existing methods detect evidence conflicts?

The researchers evaluated three types of detectors:

  1. NLI Models: Natural Language Inference models (like DeBERTa) trained specifically to spot entailment or contradiction.
  2. Factual Consistency (FC) Models: Models like AlignScore designed to check if a summary matches a source.
  3. LLMs: GPT-4, Llama-3, Claude 3, etc., prompted to answer “Yes/No” regarding whether two texts conflict.

Findings on Answer Conflicts

The results, shown in Table 2, reveal a distinct pattern.

Table 2: Answer conflict detection results ( % ) .The Precision (P), Recall (R),and F1-score (F1) are reported. We present mean performance on the two source datasets.“Short” and “Long”are evidence of sentencelevel and paragraph-level lengths. More results are in Appendix A.3.

Key Takeaways:

  • High Precision, Low Recall: LLMs and NLI models are generally conservative. They are very accurate when they do flag a conflict (High Precision), but they miss a lot of them (Low Recall), particularly weaker models like Mixtral or GPT-3.5.
  • Context Length Matters: NLI models, often trained on single sentences, struggle when the evidence becomes paragraph-length (“Long”). LLMs are more robust to length.
  • GPT-4 Dominance: Stronger models like GPT-4 and Llama-3-70B offer the best balance of precision and recall.

The “Pollution” Attack

The authors also tested a “Pollution” scenario. This mimics misinformation attacks where a bad actor takes an existing article and edits just enough text to flip the conclusion while keeping the rest of the text identical.

This is difficult because the two pieces of evidence share high lexical overlap (similar words).

Table 4: Conflict detection accuracy ( % ) on each type of evidence pairs under answer pollution attack(“polluted”) or not (“direct”). The type with the highest accuracy for each model is underlined.

As seen in Table 4, Factual Consistency (FC) models (like AlignScore) fail miserably here. Because the words are so similar, FC models assume the texts are consistent. However, NLI models and LLMs actually perform better on these tasks, likely because the contradiction is highly localized and obvious once identified.

Nuance and Intensity

Are all conflicts created equal? The researchers used the Factoid Conflict dataset to vary the intensity of the clash (i.e., how many facts disagree).

Table 5: Detection accuracy ( % ) with varying intensity of conflict or corroboration between evidence pairs. The standard deviation ( sigma ) for the categories “Low”, “Medium”,and“High” are reported following the accuracy columns,with values greater than 1O bolded.

Table 5 shows a clear trend: It is harder to detect subtle conflicts.

  • When conflict intensity is “Low” (only 1 out of 4 facts disagree), detection accuracy drops for everyone.
  • However, state-of-the-art models (GPT-4, Llama-3-70B) remain robust, acting as better “needle in a haystack” finders than their smaller counterparts.

Experiment 2: How Does AI Resolve Conflicts?

Detecting a conflict is only half the battle. In a RAG system, the user eventually wants an answer. The second research question (RQ2) asks: What are the typical behaviors in answering questions with conflicting evidence?

The researchers categorized LLM responses into several types:

  • Refrain: The model refuses to answer because of the conflict (ideal for safety).
  • Integration: The model tries to merge the two sides (sometimes valid, sometimes hallucinated).
  • Internal Knowledge: The model ignores the text and answers based on what it learned during training.
  • Chance/Bias: The model picks one side arbitrarily without justification.

The Behavior Distribution

Figure 4: Distribution of conflict resolution behaviors.

Figure 4 compares Claude 3 Haiku vs. Sonnet.

  • Bias is common: A shocking percentage of the time (38.1% for Haiku), models resolve conflicts “by chance.” They pick a winner without explaining why.
  • Improvements in stronger models: The stronger model (Sonnet) is more likely to Refrain from answering (36.4%) compared to the weaker Haiku (22.0%). This suggests that as models get smarter, they become more cautious about contradictions.

Intensity Influences Bravery

The intensity of the conflict also dictates the model’s strategy.

Figure 5: Proportions of factoid conflict resolution behaviors,stratified by annotated intensity of conflicts.

Figure 5 shows that as the conflict shifts from “Weakly conflicting” to “Strongly conflicting,” the Refrain rate (dark teal bar) skyrockets. If the contradiction is total, the model stops. If the contradiction is minor, the model attempts Integration (medium teal)—trying to smooth over the differences.

The Role of Internal Beliefs

Perhaps the most fascinating finding regarding resolution is the role of Parametric Knowledge (the model’s internal beliefs).

If an LLM already “knows” the answer (e.g., it knows the capital of France is Paris), and it sees conflicting evidence (one doc says Paris, one says Lyon), what does it do?

Figure 16: Impact of models’ internal belief on conflict resolution behaviors.

Figure 16 compares model behavior when it has a prior belief (“w/ belief”) vs. when it doesn’t (“w/o belief”).

  • Confirmation Bias: When the model has a belief (purple bars), the “Resolve by chance” and “Resolve by internal knowledge” rates go up.
  • Objectivity Loss: When the model has a belief, it is less likely to Refrain from answering. It becomes overconfident, ignoring the external evidence conflict in favor of its own memory.

Conclusion

The “ECON” paper sheds light on a critical vulnerability in modern AI systems. As we increasingly rely on LLMs to synthesize information from the web, the ability to handle conflicting evidence becomes paramount.

The authors have provided three major contributions:

  1. A Method for Data: A pipeline to generate high-quality, labeled evidence conflicts (Answer and Factoid levels).
  2. Detection Insights: While GPT-4 and strong NLI models are good at spotting conflicts, standard Factual Consistency metrics are easily fooled by lexical similarity.
  3. Resolution Hazards: When forced to resolve conflicts, LLMs often act biased, picking winners arbitrarily or leaning on their pre-training memory rather than objectively analyzing the conflicting documents.

For students and developers working on RAG systems, the takeaway is clear: Do not assume your retriever provides consistent truth. Implementing a conflict detection step—potentially using an NLI model or a strong LLM checker—is essential before letting a model generate a final answer. Without it, your AI might just be flipping a coin.