Imagine you are shopping online for a new laptop. You scroll down to the reviews to gauge public opinion. There are 50 reviews: 25 praise the battery life, and 25 complain about the screen resolution. You don’t have time to read them all, so you ask an AI assistant to summarize them.
The AI returns a summary: “Users report that the screen resolution is disappointing and grainy.”
Technically, the AI didn’t lie—people did say that. However, by omitting the 25 positive reviews about the battery, the summary is fundamentally unfair. It misrepresents the collective opinion of the document set. When Large Language Models (LLMs) perform Multi-Document Summarization (MDS), this type of bias can significantly impact decision-making in e-commerce, political analysis, and media monitoring.
In this post, we delve into a recent research paper, “Improving Fairness of Large Language Models in Multi-document Summarization,” which proposes a novel training framework called FairPO. We will explore how this method uses preference optimization to force LLMs to pay attention to underrepresented viewpoints, ensuring that summaries reflect the true distribution of opinions.
The Problem: Fairness in Summarization
Multi-document summarization (MDS) aims to condense information from multiple sources into a single coherent text. These sources often contain a social attribute—a label representing the perspective, such as “positive sentiment” in reviews, or “left-wing ideology” in political tweets.
Fairness in this context is defined at two distinct levels:
- Summary-level Fairness: Does a specific summary accurately reflect the distribution of opinions in the specific input documents? (e.g., If the input is 50% positive and 50% negative, the summary should reflect that balance).
- Corpus-level Fairness: Does the summarization model systematically favor one group over another across all inputs? (e.g., Does the model always ignore “right-wing” stances regardless of the input?).
Recent studies have shown that modern LLMs struggle with both. While prompting strategies exist (e.g., telling the model “be fair”), they rely on the user knowing that a fairness issue exists beforehand. We need a method that bakes fairness directly into the model’s weights.
Background: Measuring Fairness
Before we can improve fairness, we must measure it. The researchers utilize two key metrics derived from the concept of coverage.
1. Estimating Coverage
First, we need to know if a specific document \(d_i\) is represented in a summary sentence \(s_j\). The researchers use an entailment model to calculate the probability \(p(d_i, s_j)\) that the document is “covered” by the sentence.

Here, the metric takes the maximum entailment score between chunks of the document and the summary sentence. The total coverage probability for a document \(d_i\) by the whole summary \(S\) is the average across all sentences in the summary:

2. Equal Coverage (Summary-Level Fairness)
Equal Coverage (EC) measures fairness for a single document set. It calculates the difference between how well the summary covers a specific document versus the average coverage of all documents.
Ideally, the coverage of documents with attribute \(k\) (e.g., positive reviews) should not deviate significantly from the average. A lower EC value means the summary is fairer.

3. Coverage Parity (Corpus-Level Fairness)
Coverage Parity (CP) looks at the “big picture.” It aggregates the coverage differences across the entire dataset. If a model consistently underrepresents a specific group (e.g., negative reviews are always ignored across thousands of products), the CP score will be high. A lower CP indicates a fairer system.

Core Method: FairPO
The core contribution of this paper is FairPO (Fair Preference Optimization). It is a preference tuning method based on Direct Preference Optimization (DPO). Standard DPO trains models to prefer “better” summaries (usually based on human preference for fluency or safety). FairPO adapts this to train models to prefer “fairer” summaries.
The method consists of two phases: Perturbation-based Preference Pair Generation (to fix summary-level fairness) and Fairness-aware Preference Tuning (to fix corpus-level fairness).
Phase 1: Generating Preference Pairs via Perturbation
To train a model using preference optimization, you need pairs of data: a Chosen summary (\(S_c\)) and a Rejected summary (\(S_r\)). The chosen summary should be the one that handles diversity well, while the rejected one fails to do so.
Ideally, these summaries should differ significantly in how they represent social attributes. The authors achieve this by perturbing the input document sets.
- Identify Imbalance: For a document set \(D\), the system generates an initial summary and identifies which social attribute is overrepresented (\(k^+\)) and which is underrepresented (\(k^-\)).
- Perturb: It creates two modified document sets:
- One where a small percentage (\(\alpha\%\)) of the overrepresented documents are removed.
- One where a small percentage of the underrepresented documents are removed.
- Generate & Select: New summaries are generated for these perturbed sets.
- The summary with the lowest Equal Coverage (EC) value (the fairest one) becomes the Chosen Summary (\(S_c\)).
- The summary with the highest EC value (the most unfair one) becomes the Rejected Summary (\(S_r\)).
This process automatically creates training data where the model can clearly see the difference between a fair summary and a biased one.
Phase 2: Fairness-Aware Preference Tuning
Standard DPO treats every preference pair equally. However, in fairness tasks, some errors are worse than others. If the model already tends to ignore negative reviews (Corpus-level bias), then a training example where the rejected summary ignores negative reviews is more important to correct than one where it ignores positive reviews.
FairPO modifies the DPO objective to include dynamic weights (\(w_c\) and \(w_r\)) that adjust based on the model’s history of bias.
The FairPO objective function is:

Here, \(\pi_\theta\) is the model being trained and \(\pi_{ref}\) is the reference model (the original LLM). The term \(m\) is the standard reward margin from DPO, defined as:

The key innovation here is the introduction of \(w_c\) and \(w_r\).
Calculating the Dynamic Weights
How does the model know which weights to assign? It tracks its own performance during training.
- Tracking Bias: At every step, the system calculates whether specific attributes (like “positive sentiment”) are being overrepresented or underrepresented across the batch. It maintains a score for Overrepresentation \(O(k)\) and Underrepresentation \(U(k)\) for each attribute \(k\).

- Assigning Weights:
- If a summary helps balance the corpus (e.g., it overrepresents an attribute that the model usually ignores), it gets a higher weight.
- If a summary worsens the imbalance, it is penalized more heavily.
The weight for the chosen summary (\(w_{c,k}\)) is calculated as:

Similarly, the weight for the rejected summary (\(w_{r,k}\)) is:

Notice the ratio \(O(k)/U(k)\). If attribute \(k\) is heavily overrepresented (\(O > U\)), the weight adjusts to de-prioritize summaries that continue to overrepresent it, and prioritize summaries that give voice to the underrepresented \(U\).
The Mathematical Intuition
Why design the objective this way? The authors provide a derivation showing that FairPO’s derivative maintains the stability of DPO while injecting fairness constraints.
The derivative of standard DPO looks like this:

The derivative of FairPO looks like this:

Crucially, FairPO preserves the scaling factor \(\sigma(-m)\), which helps the model focus on “difficult” examples (where the model is unsure which summary is better). If one were to simply stick weights into the log probability terms of the standard DPO equation (a naive approach), it would distort the reward margin \(m\), making the training less stable.
Naive approach (Weighted DPO) equation:

This naive approach results in a weighted margin \(m'\):

Because the terms inside the margin are weighted differently, \(m'\) stops being a clean measure of the model’s ability to distinguish chosen vs. rejected summaries. FairPO avoids this by applying weights outside the log-ratio structure in the gradient, preserving the effectiveness of the preference optimization.
Experiments and Results
The researchers tested FairPO on three diverse datasets:
- Amazon: Product reviews (Attributes: Negative, Neutral, Positive sentiment).
- MITweet: Tweets about various topics (Attributes: Left, Center, Right ideology).
- SemEval: Tweets regarding stance on targets like “Climate Change” (Attributes: Support, Against).
They applied FairPO to three popular LLMs: Llama3.1, Mistral, and Gemma2.
Comparison with Baselines
They compared FairPO against:
- Original Models: The base LLMs without tuning.
- DPO: Standard preference optimization without fairness weights.
- OPTune: Another preference tuning method.
- Prompting: Explicitly asking the model to be fair.
- Policy Gradient: A Reinforcement Learning approach.
The results, shown in Table 2, are compelling.

Key Takeaways from the Results:
- FairPO dominates: It achieves the lowest (best) scores for both Equal Coverage (EC) and Coverage Parity (CP) across almost all models and datasets.
- Corpus-level improvement: Look at the CP scores for Llama3.1 on the Amazon dataset. The base model has a score of 1.89. FairPO reduces this to 0.42. This indicates that FairPO drastically reduces systematic bias across the entire dataset.
- DPO is not enough: While standard DPO improves fairness slightly over the base model, FairPO significantly outperforms it, proving that the specific fairness-aware weighting mechanism is necessary.
Ablation Study
Is the complex weighting really necessary? Or is the perturbation doing all the work? The authors ran ablation studies to find out.

- w/o pert. (Without Perturbation): Using random summaries instead of perturbed ones hurts performance. The perturbation is essential for creating high-quality “fair vs. unfair” pairs.
- w/o fair. (Without Fairness Tuning): Using standard DPO weights increases the CP (Corpus Parity) error. This confirms that the dynamic weighting (\(w_c, w_r\)) is critical for fixing systematic bias.
Does Fairness Hurt Quality?
A common fear in AI ethics is the “alignment tax”—the idea that making a model safer or fairer makes it less capable. The authors evaluated the summaries for Fluency, Relevance, and Factuality using Prometheus 2 (an evaluator LLM).

The results (Table 4) show pairwise comparisons. A positive number means the tuned model beat the original model.
- FairPO maintains or even improves quality across the board.
- Prompting, by contrast, devastated the quality (negative scores), likely because the long, complex prompts confused the models or forced them into unnatural phrasing.
Qualitative Analysis
Finally, let’s look at an actual example of the summaries produced.

In the Mistral example (middle column), the standard DPO summary mentions “generally positive reviews” and lists pros, with a small caveat about battery life.
The FairPO summary, however, explicitly leads with “This Toshiba tablet receives mixed reviews.” It gives equal weight to the praise (lightweight, fast) and the concerns (battery life, damaged products). This nuance—acknowledging the conflict rather than smoothing it over—is exactly what makes the summary fair.
Conclusion
Multi-document summarization is a powerful tool, but it carries the risk of silencing minority opinions or amplifying majority biases. This paper demonstrates that we don’t have to accept this trade-off.
FairPO introduces a robust framework for aligning LLMs with fairness objectives. By combining perturbation-based data generation (to teach the model what fairness looks like) with fairness-aware preference tuning (to teach the model how important fairness is regarding corpus balance), FairPO significantly reduces bias. Importantly, it achieves this without degrading the coherence or factuality of the summaries.
For students and researchers in NLP, FairPO illustrates a broader lesson: standard optimization objectives (like DPO) can be mathematically adapted to solve specific ethical constraints, provided we can formulate those constraints (like Coverage Parity) into differentiable or weight-adjustable signals.
This blog post explains the research presented in “Improving Fairness of Large Language Models in Multi-document Summarization” by Haoyuan Li, Rui Zhang, and Snigdha Chaturvedi.
](https://deep-paper.org/en/paper/2506.07479/images/cover.png)