Trust Issues: Teaching LLMs to Cite Sources in Long Documents

We live in an era where Large Language Models (LLMs) can summarize a book or analyze a legal contract in seconds. However, for anyone using these tools for serious research or work, a nagging question remains: Can I trust this?

LLMs are notorious for hallucinations—generating plausible-sounding but completely incorrect information. When you are using an LLM as a “long document assistant”—for example, asking it to extract specific clauses from a 50-page PDF—accuracy is non-negotiable. To build trust, we need the model to do two things better: Attribute (provide evidence for its claims) or Abstain (admit when the answer isn’t there).

In this post, we dive deep into the paper Attribute or Abstain: Large Language Models as Long Document Assistants. We will explore how researchers evaluated different strategies for making LLMs more accountable, shifting the focus from simple question-answering to verifiable, evidence-based reasoning.

The Problem: Hallucination vs. Verifiability

Imagine asking a research assistant, “What is the size of the dataset used in this paper?” If the assistant answers “567 reviews,” you have to take their word for it. But if they answer, “567 reviews, as mentioned in Table 2 on page 4,” you can verify it immediately.

This is the core of the problem. Current LLMs are often like the first assistant—confident but opaque. The goal is to transform them into the second type.

Figure 1: Long document assistants should attribute, i.e. provide responses with evidence, or abstain. Example from QASPER.

As shown in Figure 1, an ideal system has three potential outcomes:

Attribute: Answer the question and point to the text snippet (evidence) that supports it.
Abstain: Recognize that the document does not contain the answer and explicitly state “Not answerable.”
Fail: Hallucinate an answer or provide irrelevant evidence (the outcome we want to avoid).

While “Retrieval Augmented Generation” (RAG) is a popular method for connecting LLMs to external data (like Wikipedia), this paper focuses on a different, specific scenario: Long Document tasks. Here, the “database” is a single, lengthy document that fits (or nearly fits) into the model’s context window. The challenge isn’t finding a needle in a haystack of millions of documents; it’s reasoning accurately over one dense text without losing focus.

The LAB Benchmark

To study this, the researchers introduced LAB (Long-document Attribution Benchmark). They needed a diverse set of testing grounds to ensure their findings weren’t specific to just one type of text.

They compiled 6 datasets spanning science, law, government, and general knowledge.

Table 1: The datasets in LAB span multiple domains and task types.

As listed in Table 1, the tasks vary significantly:

QASPER: Questions about scientific papers.
ContractNLI: Checking legal contracts for specific clauses (entailment).
GovReport: Summarizing government reports.
Evidence Inference: Determining medical outcomes from clinical trial texts.

This diversity is crucial because a method that works for summarizing a government report might fail when checking a strict legal definition.

Core Method: How to Force an LLM to Cite Sources

The heart of this research lies in how we ask the model to perform the task. Do we let it read the whole document? do we force it to search first? The authors experimented with several distinct architectures for attribution.

Figure 2: The approaches to attribution in long document scenarios analyzed in this work.

Figure 2 illustrates the five approaches compared in the study. Let’s break down the three most important ones:

1. Post-Hoc (Generate, then Verify)

In this approach, the model acts like a student taking a test who writes down the answer first, then goes back to the textbook to find a quote that backs it up.

Step 1: The LLM generates a response \(R\) based on the document \(D\).
Step 2: A search mechanism uses that response to find the best evidence segments \(E\) in the document.
Pros: This separates the difficulty of writing from the difficulty of searching.

2. Retrieve-then-Read (Standard RAG)

This is the classic search-engine approach.

Step 1: Retrieve relevant chunks of text \(E\) from the document based on the question.
Step 2: Feed only those chunks to the LLM to generate the answer.
Pros: Less data for the LLM to process.
Cons: If the retriever misses the relevant paragraph in step 1, the LLM has zero chance of answering correctly.

3. Citation (The “Power User” Method)

Here, the LLM is prompted to perform both tasks simultaneously. It reads the document and generates the answer with embedded citations (e.g., “The dataset contains 500 images [1]…”).

Mechanism: The model produces the response and the evidence pointers in a single pass.
Pros: The model has the full context of the document and can weave evidence naturally into its reasoning.

The researchers also tested “Reduced” versions (Reduced-Post-Hoc and Reduced-Citation), where they used a retriever to shrink the document down to the top 10 most relevant segments before giving it to the LLM, attempting to save on computational costs and context window usage.

Experiments & Results

The researchers tested these methods on five LLMs of varying sizes:

Large/Proprietary: GPT-3.5 and GPT-4.
Open Source/Smaller: Longchat (7B), Mistral (7B), and a fine-tuned Flan-T5.

They measured success using metrics for Response Quality (is the answer right?) and Evidence Quality (is the citation accurate?).

RQ1: What is the Best Approach?

The results revealed a fascinating split depending on the “intelligence” or size of the model.

Table 3: Evaluation on LAB. Citation / reduced-citation mostly perform best, with notable exceptions.

Table 3 provides the comprehensive scoreboard. Here are the key takeaways:

Smart Models Should Cite: For highly capable models like GPT-4 and the fine-tuned Flan-T5, the Citation method (generating answer and evidence together) generally performed best. These models are capable enough to hold the answer and the source in their “head” simultaneously.
Smaller Models Need Help: Smaller models like Longchat and Mistral struggled with the Citation method. They performed better with the Post-Hoc approach. This suggests that smaller models lack the instruction-following capability to multitask; they need to decompose the problem into “answer first, find proof later.”
Retrieval Can Hurt: Interestingly, the Retrieve-then-Read approach often performed worse than simply letting the model read the whole document (Citation). In the context of a single long document, pre-filtering the text risks cutting out crucial context or dispersed information that the model needs to synthesize an answer.

RQ2: The “Lost in the Middle” Phenomenon

A known issue in LLMs is the “Lost in the Middle” effect—where models are great at remembering information at the start or end of a prompt, but forget information buried in the middle. The researchers investigated if this bias applied to attribution.

Figure 3: Top: Evidence distribution by position. Bottom: Response quality by position.

Figure 3 (Top) compares where the models found evidence versus where the evidence actually was (Gold Evidence).

Findings: Surprisingly, they did not find a strong “Lost in the Middle” effect for attribution. The models (colored bars) generally matched the distribution of the ground truth (striped bars). They were able to find evidence regardless of where it was located in the text.

However, looking at Figure 3 (Bottom), which plots Response Quality, we see a different story. The downward trend in the lines indicates that while models can find the evidence anywhere, their ability to form a correct answer decreases when the relevant info is located towards the end of the document.

RQ3: Can Evidence Predict Accuracy?

If an LLM provides a high-quality citation, does that mean the answer is correct? If so, we could use “Attributability” (the quality of the evidence) as a proxy for confidence. If the model can’t cite its work, we should program it to abstain.

Table 4: Difference in response quality-coverage AUC between responses ordered by evidence quality.

Table 4 shows the results of “Selective Prediction”—checking if filtering out responses with bad evidence improves the overall accuracy score.

The Good News: For datasets involving single facts (like Natural Questions or Evidence Inference), high-quality evidence strongly correlated with high-quality answers. If the model cited a source well, it was usually right.
The Bad News: For complex tasks like GovReport (summarization) or QASPER (multi-hop reasoning), this correlation broke down.
Why? The researchers found that models often gave the correct answer but failed to cite all the necessary evidence for complex claims. This mismatch makes it dangerous to rely solely on citation quality as a filter for complex tasks, as you might throw away correct answers simply because the model was lazy with its footnotes.

Conclusion and Implications

The paper “Attribute or Abstain” provides a roadmap for building more reliable AI assistants for professional workflows.

The most immediate takeaway for students and developers is that architecture matters. You cannot simply throw a long PDF at an LLM and hope for the best.

If you have the budget for GPT-4, prompt it to use inline citations. It effectively utilizes the full context.
If you are building with smaller, open-source models (like 7B parameter models), you should build a pipeline: let the model answer freely, and then run a separate process to verify that answer against the text (Post-Hoc).

Furthermore, the “Abstain” capability remains a frontier. While models are getting better at citing what they know, they still struggle to reliably admit what they don’t know, especially when the answer requires synthesizing information scattered across a long text. As we move forward, the ability of an AI to say “I checked the document, and it’s not there” will be just as valuable as the ability to generate an answer.

The Problem: Hallucination vs. Verifiability#

The LAB Benchmark#

Core Method: How to Force an LLM to Cite Sources#

1. Post-Hoc (Generate, then Verify)#

2. Retrieve-then-Read (Standard RAG)#

3. Citation (The “Power User” Method)#

Experiments & Results#

RQ1: What is the Best Approach?#

RQ2: The “Lost in the Middle” Phenomenon#

RQ3: Can Evidence Predict Accuracy?#

Conclusion and Implications#