How LLMs Can Teach Themselves to Be More Trustworthy

Large Language Models (LLMs) are extraordinary tools—able to answer complex questions, generate code, and summarize documents almost instantaneously. Yet they share a persistent flaw: hallucination. Ask an LLM for information, and it may confidently produce a fluent, detailed, and completely inaccurate answer.

For casual use, this is merely amusing. But for serious applications—in research, journalism, or medicine—hallucinations are catastrophic. How can we trust AI systems if we cannot verify their claims?

One of the most promising solutions is attribution. Instead of merely answering a question, an attributed model cites evidence directly in its response. For instance: “Freshwater is water that is not salty or brackish [1], and may be unsuitable for drinking without treatment [2].” With these citations, users can trace the information back to its sources and confirm its validity.

The challenge? Teaching an LLM to do that well. Current approaches require vast amounts of costly, human-labeled data or rely on distillation from proprietary models like GPT-4—neither scalable nor practical in the long run.

But what if an LLM could teach itself this skill?

That is exactly what the paper Advancing Large Language Model Attribution through Self-Improving proposes. The authors introduce START (Self-Taught AttRibuTion), a framework that enables LLMs to bootstrap their own attribution abilities without human supervision or teacher models. START makes a model learn from its own outputs—both the good and the flawed—to iteratively refine how it cites evidence and synthesizes information.

In this article, we’ll unpack how START works: how it tackles the “cold-start” problem of self-learning, how it rewards factually grounded responses, and how it even learns from mistakes.

The Challenge of Self-Improvement in Attribution

The idea of self-improvement in AI is straightforward: a model generates candidate solutions, evaluates them, keeps the high-quality ones for training, and repeats the process. This approach works well for tasks with clear correctness—like math problems—where outputs are either right or wrong.

Attribution is far more complex. A good cited response must satisfy three criteria:

Attributability: Every claim must be supported by its cited evidence.
Comprehensiveness: The response should aggregate relevant information across multiple documents.
Robustness: The model must resist distractions from irrelevant texts.

Generic LLMs are poor at all three. If they try self-learning directly, their generated samples are low-quality, and training only on those risks model stagnation—a feedback loop of mediocrity. Moreover, even correct samples convey limited supervision: they teach citation format, not reasoning about evidence quality.

START resolves these twin challenges with a carefully designed two-stage process—an initial Synthetic Data Warm-Up followed by Iterative Self-Improvement.

An overview of the two-stage START framework, showing the Warm-Up and Self-Improving stages.

Figure 2. START combines a synthetic data warm-up with iterative self-improvement. The model first learns from carefully generated synthetic examples, then progressively refines itself through rejection sampling and preference optimization.

Stage 1: Warming Up with Synthetic Data

To prevent the model from stagnating early, START first gives it “perfect” examples to learn from—without any human labeling. The researchers achieve this using reverse attribution: instead of generating answers from given documents, the model starts with an answer and generates documents that justify it.

The five-step data synthesis pipeline (shown in Figure 1) works as follows.

The five-step pipeline for generating synthetic attribution data.

Figure 1. The synthetic data pipeline: starting from a query, the model generates a response, decomposes it into claims, creates supporting documents for those claims, and then relabels citations to produce a perfectly attributable answer.

Step 1 – Response Generation:
The LLM receives a query—for instance, “What is the difference between fresh water and potable water?”—and produces a detailed answer using its internal knowledge, without citations.

Step 2 – Claim Decomposition:
The model breaks this response into atomic claims, such as:

Freshwater refers to water that is not salty or brackish.
Freshwater may be unsuitable for drinking without treatment.
Potable water is safe and suitable for human consumption.

Step 3 – Claim Combination:
To simulate a multi-source setting, atomic claims are randomly grouped, representing different mixtures of viewpoints and sources.

Step 4 – Document Generation:
For each claim set, the model generates short synthetic documents that cover those claims—creating evidence tailor-made to support them. To enhance robustness, a few irrelevant documents from other queries are added.

Step 5 – Attribution Relabel:
Finally, the model revisits its original response and inserts citations linking each claim to the supporting document set.

The outcome is a perfectly matched synthetic dataset, with every citation guaranteed to be valid. This data is used for supervised fine-tuning under the Maximum Likelihood Estimation (MLE) objective:

\[ \mathcal{L} = -\sum_{i=1}^{N} \log P(y_i \mid q_i, \mathcal{D}_i; \theta) \]

The Maximum Likelihood Estimation (MLE) objective function used for supervised fine-tuning.

Figure 3. Standard MLE objective used for warming up the model on the synthetic dataset.

This warm-up stage is vital—it gives the model a foundation for attribution and ensures it can produce usable outputs for further refinement.

Stage 2: Iterative Self-Improvement

After warm-up, the model begins an iterative loop to teach itself increasingly fine-grained attribution skills. Each iteration has two phases:

Phase A – Rejection Sampling Fine-Tuning
Phase B – Fine-Grained Preference Optimization

Phase A: Rejection Sampling—Finding High-Quality Examples

For each query, the warmed-up model generates multiple candidate responses (e.g., 16). Each candidate is scored on three dimensions:

Attributability: Whether each statement is fully supported by its cited documents. Measured by an NLI model determining entailment:

\[ AttrScore = \frac{1}{S} \sum_{i=1}^{S} \text{Entail}(\text{Docs},\text{statement}_i) \]

Equation for the Attributability Score.

Figure 4. The Attributability score quantifies factual grounding between statements and cited documents.

Robustness: How resistant the response is to irrelevant contexts.
\[ \text{RobustScore} = \frac{P_M(y \mid q\oplus d_r)}{P_M(y \mid q\oplus D)} \]

Figure 5. The Robustness score compares probabilities under relevant vs. full (noisy) document sets.

Comprehensiveness: Whether the response covers all essential claims from the synthetic “gold” answer:
\[ \mathrm{CompreScore} = \frac{1}{C} \sum_{i=1}^{C} \mathrm{Entail}(\mathrm{claim}_i, y) \]

Figure 6. The Comprehensiveness score measures coverage relative to all fundamental claims.

These metrics combine into a holistic reward:

\[ Reward = \mathbb{I}(AttrScore) \times \frac{CompreScore}{RobustScore} \]

Equation for the holistic Reward function.

Figure 7. The holistic reward enforces strict factuality while balancing coverage and robustness.

Only candidates with perfect Attributability (score = 1.0) qualify—others receive zero reward. The highest-rewarded response becomes a new example for supervised fine-tuning, effectively expanding the training data through rejection sampling.

Phase B: Learning from Mistakes—Fine-Grained Preference Optimization

Instead of discarding low-ranking samples, START recognizes their pedagogical value. The team constructs preference pairs—a “winner” (high-reward sample \(y^+\)) and a “loser” (low-reward sample \(y^-\))—and trains the model to prefer the better one using Direct Preference Optimization (DPO).

\[ \mathcal{L}_{DPO} = -\mathbb{E}[\log \sigma(\hat{r}_\theta(x,y^+) - \hat{r}_\theta(x,y^-))] \]

\[ \hat{r}_\theta(x,y) = \beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \]

The Direct Preference Optimization (DPO) loss function. The implicit reward function used in DPO.

Figure 8. Preference optimization refines the model by comparing responses with different attribution and coverage quality.

These pairs target specific weaknesses—such as high attributability but low comprehensiveness—to teach nuanced trade-offs that single-sample fine-tuning cannot. Each iteration of this two-phase loop yields a more capable and judicious model.

Does It Actually Work? The Results

The START framework was tested on three demanding question-answering datasets: ASQA, ELI5, and StrategyQA. The results speak for themselves.

Main results comparing START to various baselines across three datasets.

Table 1. START achieves state-of-the-art citation quality across all datasets, outperforming systems trained on costly human or distilled data.

START surpasses baselines across all benchmarks. Compared to prior approaches—including in-context learning, post-hoc citation matching, and models trained on human or GPT-4 data—START improves average citation quality by 25.13%.

Perhaps most striking is how effectively it learns over time: after warm-up, its citation F1 on ASQA jumps from 23.5 to 72.0 after one iteration, then continues climbing through subsequent rounds. The model genuinely improves itself.

Why Both Stages Matter

Ablation studies show the warm-up and preference optimization stages are indispensable.

Ablation study results showing the impact of removing the warm-up or preference optimization stages.

Figure 9. Removing either the initial warm-up or fine-grained preference optimization leads to significant performance drops.

Without warm-up (w/o warm-up), the model fails to reach early proficiency and stagnates. Its first-iteration pass rate for fully attributable outputs collapses to 3.24%, compared to 42.5% with warm-up.

Comparison of the pass rate during rejection sampling with and without the warm-up stage.

Table 3. Warm-up dramatically increases pass rate, enabling richer supervision in early self-learning.

Removing preference optimization (w/o preference) similarly degrades performance, proving that learning from low-quality examples is critical to mastery.

Moreover, training longer on synthetic data alone yields limited benefit, as shown below: self-improvement provides stronger supervision than repetitive fine-tuning.

A comparison showing that one iteration of self-improvement is more effective than extended training on the initial synthetic data.

Figure 3. One iteration of self-improvement outperforms many epochs of training on static synthetic data.

Human evaluation reinforces these quantitative results: START produced fully supported citations in 76.2% of cases—exceeding even ChatGPT—and attained the top comprehensiveness score among all systems.

Broader Implications and Conclusion

LLMs are increasingly used in environments where accuracy and verifiability are paramount. START offers a robust, scalable path forward, enabling models to learn trustworthy attribution without external assistance.

Its success hinges on two elegant ideas:

Solve cold-start through perfect synthetic data: Reverse attribution provides flawless initial supervision.
Iteratively refine through self-generated signals: Rejection sampling and preference optimization transform the model’s own judgments into powerful teaching moments.

Beyond just citation generation, START exemplifies how LLMs can bootstrap complex reasoning skills. As AI systems become central to information retrieval and decision-making, frameworks like START will be indispensable for ensuring that what machines tell us is not just fluent—but verifiable.

The Challenge of Self-Improvement in Attribution#

Stage 1: Warming Up with Synthetic Data#

Stage 2: Iterative Self-Improvement#

Phase A: Rejection Sampling—Finding High-Quality Examples#

Phase B: Learning from Mistakes—Fine-Grained Preference Optimization#

Does It Actually Work? The Results#

Why Both Stages Matter#

Broader Implications and Conclusion#