Fixing the Trust Gap: How Coarse-Grained Decomposition Improves AI Citations

In the rapidly evolving world of Generative AI, trust is the new currency. We all marvel at the fluency of Large Language Models (LLMs) like GPT-4 or Claude, but a persistent shadow hangs over their output: hallucinations. When an AI answers a complex question based on a long document, how do we know it isn’t making things up?

The industry standard solution is attribution—citing sources. Just like a student writing a thesis, an AI should point to the exact sentence in a source document that supports its claims. However, this is easier said than done, especially when the AI generates a long, complex answer that synthesizes multiple facts.

Today, we are diving deep into a fascinating research paper titled “Enhancing Post-Hoc Attributions in Long Document Comprehension via Coarse Grained Answer Decomposition” by researchers from Adobe Research. This paper tackles a specific, thorny problem: Granularity. When an AI generates a complex sentence, how do we break it down to find the right evidence? If we break it down too much, we lose context. If we don’t break it down enough, we can’t find specific proof.

The researchers propose a novel method called Coarse Grained Answer Decomposition (CoG), which uses the context of the user’s question to intelligently split answers into verifiable “information units.”

Let’s unpack how this works, why it matters, and how it outperforms existing methods.


1. The Problem: The Challenge of “Post-Hoc” Attribution

First, let’s establish the setting. We are dealing with Post-Hoc Attribution.

Imagine you have a Question Answering (QA) system. You feed it a 50-page PDF and ask a question. The system generates an answer. Post-hoc attribution is the process that happens after the answer is generated. It acts like a fact-checker, looking back at the answer and the original PDF to verify which sentences in the PDF support the specific claims in the answer.

The Granularity Dilemma

The core challenge the authors address is what exactly should be attributed?

  • Scenario A (Too Broad): If the AI generates a paragraph, pointing to a whole page in the source document is unhelpful.
  • Scenario B (Too Specific): If we try to verify every single noun and verb (fine-grained), we might find “evidence” for the word “cat” but fail to prove the cat was actually “sitting on the mat.”
  • Scenario C (The Complexity Trap): Answers are often complex. Consider the sentence: “To paint cast iron, you should first coat it with oil-based primer to create a smooth surface.”

This single sentence contains two distinct facts:

  1. You must use an oil-based primer.
  2. The primer creates a smooth surface.

Finding a single sentence in the source document that proves both simultaneously might be impossible. The source document might mention the primer on page 1 and the smoothing effect on page 2.

Existing methods often treat answers as singular blocks or break them down into “atomic facts” without considering the context of the question. This leads to under-attribution (missing citations) or over-attribution (citing irrelevant text).


2. The Solution: Coarse Grained Answer Decomposition (CoG)

The researchers introduce a pipeline designed to solve the granularity issue. The core idea is simple but powerful: Don’t just break the answer down; break it down based on the question.

The Pipeline

As shown in the figure below, the process involves taking the generated answer and passing it through a “CoG Decomposition” step before trying to find evidence.

Figure 2: Pipeline for attribution: Answers are decomposed and sent to the atributor for identifying evidences.

The pipeline has two main stages:

  1. Decomposition: Breaking the complex answer into smaller “Information Units” (IUs).
  2. Attribution: Using a Retriever or an LLM to find evidence for those specific units.

Why “Coarse Grained”?

Previous state-of-the-art methods, like FActScore, focus on “atomic facts”—the smallest possible pieces of information. While rigorous, atomic facts often strip away the semantic glue that holds an idea together.

The Adobe researchers propose Coarse Grained (CoG) decomposition. They want “chunks” of information that are:

  • Relevant: They actually matter to the user’s question.
  • Meaningful: They stand on their own as statements.
  • Contextual: They are co-referenced with the question.

The Role of the Question

This is the paper’s “secret sauce.” Most decomposition methods look only at the Answer. CoG looks at the Question + Answer.

If the question is “How do I paint cast iron?”, the decomposition model knows to focus on the steps and materials. If the answer contains fluff like “That is a great question, here is what I found…”, CoG knows to ignore it because it’s irrelevant to the question.

Below is an example from the Verifiability dataset used in the paper. Notice the text in red. The system needs to be smart enough not to try and attribute the polite conversational filler.

Figure 1: An example from Verifiability dataset. The input to the post-hoc attribution system is the question, document and answer. The output is evidence sentences from the document. Text marked in red do not require attribution.

Negative Sampling: Teaching the AI What Not To Do

To get the LLM to decompose answers correctly, the researchers used In-Context Learning (ICL) with a twist. Usually, prompts provide examples of “good” outputs. The authors found that LLMs struggle with negation (i.e., instructions like “don’t do X”).

Instead of just telling the model “don’t create duplicate facts,” they provided examples of Bad Information Units in the prompt (Negative Sampling).

  • Good Unit: “The music video was filmed at Oheka Castle.”
  • Bad Unit (Redundant): “The song is ‘For You’.” (If the answer is already about the song ‘For You’, stating this fact again is useless noise).

3. The Methodology: A Deep Dive

Let’s look under the hood at the specific components that make this system work.

3.1 The Rule-Based Classifier

Decomposing every single sentence using a powerful LLM is expensive and slow. The authors realized that simple sentences don’t need to be broken down.

For example: “Alex is an engineer.” This is a single clause. Decomposing it is a waste of time.

To solve this, they implemented a classifier based on linguistic rules (Part-of-Speech tagging).

Equation determining if a sentence is simple.

Don’t let the notation scare you. Here is the translation:

  • \(S\): The set of tags in the sentence.
  • \(N, P, V, A\): Noun, Pronoun, Verb, Article.
  • The logic checks if the sentence is essentially just a combination of nouns/pronouns/articles and at most one verb (\(|S| - 1\)).

If a sentence meets these criteria (a simple independent clause), it skips the decomposition step and goes straight to the attributor. This simple filter improves efficiency without sacrificing accuracy.

3.2 The Attributors

Once the answer is broken down into Information Units (IUs), the system needs to find the evidence. The paper explores two ways to do this:

  1. Retrievers: Traditional search algorithms (like BM25) or dense retrievers (like GTR, MonoT5). These rank sentences in the document based on how well they match the IU.
  2. LLMs: Asking a model like GPT-4 or LLaMA-2 to look at the document and select the evidence.

3.3 Greedy Merging

When using retrievers, we might get multiple “hits” for different parts of the answer. The researchers developed a Greedy Merging Algorithm to organize this.

Algorithm 1 Merging of Evidences for Answer Part

The algorithm ensures that for every piece of information in the answer, we grab the highest-scoring evidence sentence. If multiple information units point to the same evidence, we don’t list it twice. The goal is a clean, sorted list of evidence that covers the whole answer without redundancy.


4. Experiments and Results

The researchers tested their method on two datasets:

  1. Citation Verifiability Dataset: Questions from search engines (Bing/Google) where answers have inline citations.
  2. QASPER: A dataset of questions based on NLP research papers, requiring deep technical understanding.

They compared three scenarios:

  • NIL: No decomposition (using the whole answer sentence as the query).
  • FActScore: Fine-grained atomic fact decomposition (the previous baseline).
  • CoG: The proposed coarse-grained, question-aware decomposition.

Result 1: Retrievers Perform Better with CoG

Table 1 below shows the performance when using standard retrieval methods (BM25, GTR, MonoT5).

Table 1: Retrieval based attributor results

Key Takeaway: Look at the Precision (P) and F1 scores for CoG. Across the board, CoG outperforms FActScore.

  • On the Verifiability dataset, CoG+MonoT5 achieves an F1 of 0.62, compared to 0.55 for FActScore.
  • Interestingly, NIL (no decomposition) performs quite well on QASPER. The authors note that QASPER often contains extractive answers (copy-pasted from the paper), so breaking them down didn’t help as much. However, for abstractive answers (Verifiability), decomposition is king.

Result 2: LLMs become State-of-the-Art Attributors

The results get even more interesting when using LLMs (like GPT-4) to find the evidence.

Table 2: LLM based attribution results

Key Takeaway: Using CoG with GPT-4 creates a powerhouse system.

  • On the Verifiability dataset, the Recall (R) jumps to 0.79 with CoG, compared to 0.69 with FActScore.
  • This suggests that when we feed an LLM cleaner, more contextualized “chunks” of information, it becomes much better at spotting the supporting evidence in the text.

Why did FActScore fail?

The authors provide a compelling analysis of why the “atomic fact” approach (FActScore) underperformed.

Figure 3: Average number of decomposition per sentence using each method.

As shown in the chart above, FActScore (Blue bar) generates significantly more decompositions per sentence (over 3.3) compared to CoG (around 1.6).

While “more” sounds “detailed,” in retrieval, “more” often means noise. By slicing the sentence into tiny atoms, FActScore diluted the semantic meaning, making it harder for the retriever to match the query to the document. CoG strikes the “Goldilocks” balance—not too big, not too small.

Qualitative Analysis

Let’s look at a real example to see the difference in action.

Question: “paint cast iron” Answer: “To paint cast iron, you should first coat it with oil-based primer…”

Table 7: Qualitative example of how decomposition affcts retrieval based attributor and LLM based atributor. GT refers to ground truth. Each row depicts an answer part and respective decompositions and attributions for each method.

In Table 3 (shown above as image 007), observe the column CoG Decompositions. It breaks the sentence into two logical steps:

  1. Coat with oil-based primer.
  2. Priming creates a smooth surface.

Now look at the NIL-MonoT5 column (using the whole sentence). It retrieves a sentence about “wiping it down with a damp rag,” which is incorrect context. However, CoG-MonoT5 correctly retrieves the sentence about “coating it with oil-based primer.”

By splitting the sentence logically, CoG allowed the retriever to find the exact match for the specific instruction, rather than getting confused by a complex compound sentence.


5. Ablation Studies: What really matters?

The paper performs several “ablation studies” (testing the system by removing parts of it) to prove their hypotheses.

The Effect of the “Question”

One major claim was that including the Question in the decomposition prompt is vital. Table 4 compares CoG - Question (decomposition without seeing the question) vs. CoG (with the question).

Table 4: Example from Citation Verifiability dataset: In the answer, portions highlighted in red do not need attributions.Lists show the decomposition outputs for each answer part. CoG - Question denotes coarse-grain decomposition without the question in context.

In the Answer, the first sentence is “Are you looking for information on how to paint cast iron?”

  • FActScore tries to decompose this into facts: “The person is looking for information.” (Useless fact).
  • CoG - Question also struggles, creating a unit for the introductory text.
  • CoG (Standard) returns [] (empty list) for that sentence. It knows that, in the context of the question “paint cast iron,” the polite intro sentence contains no attributable facts.

This selective attention is critical for reducing hallucinations and processing time.

The Impact on Retriever Scores

The researchers also highlighted how decomposition fixes “score dilution.”

Table 5: Example of retriever score getting affected while using answer part as iu vs using decomposed iu.

In this example, the original answer sentence contained multiple citations (BIBREF9, BIBREF10).

  • NIL (No decomposition): The retriever gets confused by the mix of topics and gives a low relevance score (-0.043).
  • CoG: By isolating the specific fact about “WNUT16,” the retriever focuses purely on that topic, resulting in a better relevance score (-0.021). (Note: In these models, scores are often negative distances, so closer to 0 is better).

6. Conclusion and Implications

The paper “Enhancing Post-Hoc Attributions in Long Document Comprehension via Coarse Grained Answer Decomposition” offers a significant step forward in making AI trustworthy.

Here are the key takeaways for students and practitioners:

  1. Context is King: Blindly decomposing text into “atoms” is counter-productive. By including the Question in the loop, we generate decompositions that are semantically relevant.
  2. Granularity Matters: There is a sweet spot between the whole sentence and the atomic fact. “Coarse-grained” units preserve enough meaning to be searchable but are specific enough to be verifiable.
  3. Efficiency via Rules: We don’t need AI for everything. Simple rule-based classifiers (like the POS tagger used here) can filter out simple sentences, saving computational resources.
  4. Better Inputs = Better Retrieval: Whether you use a standard search engine (BM25) or a fancy LLM (GPT-4), feeding it clean, decomposed queries yields significantly higher precision in finding evidence.

As we move toward AI agents that read entire books or legal repositories to answer our questions, methods like CoG will be the backbone of verification systems, ensuring that when an AI tells us something, it can back it up with the right receipts.