Imagine you are shopping online for a new laptop. You scroll down to the reviews to gauge public opinion. There are 50 reviews: 25 praise the battery life, and 25 complain about the screen resolution. You don’t have time to read them all, so you ask an AI assistant to summarize them.

The AI returns a summary: “Users report that the screen resolution is disappointing and grainy.”

Technically, the AI didn’t lie—people did say that. However, by omitting the 25 positive reviews about the battery, the summary is fundamentally unfair. It misrepresents the collective opinion of the document set. When Large Language Models (LLMs) perform Multi-Document Summarization (MDS), this type of bias can significantly impact decision-making in e-commerce, political analysis, and media monitoring.

In this post, we delve into a recent research paper, “Improving Fairness of Large Language Models in Multi-document Summarization,” which proposes a novel training framework called FairPO. We will explore how this method uses preference optimization to force LLMs to pay attention to underrepresented viewpoints, ensuring that summaries reflect the true distribution of opinions.

The Problem: Fairness in Summarization

Multi-document summarization (MDS) aims to condense information from multiple sources into a single coherent text. These sources often contain a social attribute—a label representing the perspective, such as “positive sentiment” in reviews, or “left-wing ideology” in political tweets.

Fairness in this context is defined at two distinct levels:

  1. Summary-level Fairness: Does a specific summary accurately reflect the distribution of opinions in the specific input documents? (e.g., If the input is 50% positive and 50% negative, the summary should reflect that balance).
  2. Corpus-level Fairness: Does the summarization model systematically favor one group over another across all inputs? (e.g., Does the model always ignore “right-wing” stances regardless of the input?).

Recent studies have shown that modern LLMs struggle with both. While prompting strategies exist (e.g., telling the model “be fair”), they rely on the user knowing that a fairness issue exists beforehand. We need a method that bakes fairness directly into the model’s weights.

Background: Measuring Fairness

Before we can improve fairness, we must measure it. The researchers utilize two key metrics derived from the concept of coverage.

1. Estimating Coverage

First, we need to know if a specific document \(d_i\) is represented in a summary sentence \(s_j\). The researchers use an entailment model to calculate the probability \(p(d_i, s_j)\) that the document is “covered” by the sentence.

Equation 1: Calculating the maximum entailment probability between document chunks and summary sentences.

Here, the metric takes the maximum entailment score between chunks of the document and the summary sentence. The total coverage probability for a document \(d_i\) by the whole summary \(S\) is the average across all sentences in the summary:

Equation 2: Averaging coverage probability across all summary sentences.

2. Equal Coverage (Summary-Level Fairness)

Equal Coverage (EC) measures fairness for a single document set. It calculates the difference between how well the summary covers a specific document versus the average coverage of all documents.

Ideally, the coverage of documents with attribute \(k\) (e.g., positive reviews) should not deviate significantly from the average. A lower EC value means the summary is fairer.

Equation 3: Equal Coverage formula, calculating the average absolute coverage probability difference for each attribute.

3. Coverage Parity (Corpus-Level Fairness)

Coverage Parity (CP) looks at the “big picture.” It aggregates the coverage differences across the entire dataset. If a model consistently underrepresents a specific group (e.g., negative reviews are always ignored across thousands of products), the CP score will be high. A lower CP indicates a fairer system.

Equation 4: Coverage Parity formula, measuring systematic over or underrepresentation across the corpus.

Core Method: FairPO

The core contribution of this paper is FairPO (Fair Preference Optimization). It is a preference tuning method based on Direct Preference Optimization (DPO). Standard DPO trains models to prefer “better” summaries (usually based on human preference for fluency or safety). FairPO adapts this to train models to prefer “fairer” summaries.

The method consists of two phases: Perturbation-based Preference Pair Generation (to fix summary-level fairness) and Fairness-aware Preference Tuning (to fix corpus-level fairness).

Phase 1: Generating Preference Pairs via Perturbation

To train a model using preference optimization, you need pairs of data: a Chosen summary (\(S_c\)) and a Rejected summary (\(S_r\)). The chosen summary should be the one that handles diversity well, while the rejected one fails to do so.

Ideally, these summaries should differ significantly in how they represent social attributes. The authors achieve this by perturbing the input document sets.

  1. Identify Imbalance: For a document set \(D\), the system generates an initial summary and identifies which social attribute is overrepresented (\(k^+\)) and which is underrepresented (\(k^-\)).
  2. Perturb: It creates two modified document sets:
  • One where a small percentage (\(\alpha\%\)) of the overrepresented documents are removed.
  • One where a small percentage of the underrepresented documents are removed.
  1. Generate & Select: New summaries are generated for these perturbed sets.
  • The summary with the lowest Equal Coverage (EC) value (the fairest one) becomes the Chosen Summary (\(S_c\)).
  • The summary with the highest EC value (the most unfair one) becomes the Rejected Summary (\(S_r\)).

This process automatically creates training data where the model can clearly see the difference between a fair summary and a biased one.

Phase 2: Fairness-Aware Preference Tuning

Standard DPO treats every preference pair equally. However, in fairness tasks, some errors are worse than others. If the model already tends to ignore negative reviews (Corpus-level bias), then a training example where the rejected summary ignores negative reviews is more important to correct than one where it ignores positive reviews.

FairPO modifies the DPO objective to include dynamic weights (\(w_c\) and \(w_r\)) that adjust based on the model’s history of bias.

The FairPO objective function is:

Equation 5: The FairPO objective function with separate weights for chosen and rejected summaries.

Here, \(\pi_\theta\) is the model being trained and \(\pi_{ref}\) is the reference model (the original LLM). The term \(m\) is the standard reward margin from DPO, defined as:

Equation 6: The standard DPO reward margin calculation.

The key innovation here is the introduction of \(w_c\) and \(w_r\).

Calculating the Dynamic Weights

How does the model know which weights to assign? It tracks its own performance during training.

  1. Tracking Bias: At every step, the system calculates whether specific attributes (like “positive sentiment”) are being overrepresented or underrepresented across the batch. It maintains a score for Overrepresentation \(O(k)\) and Underrepresentation \(U(k)\) for each attribute \(k\).

Equation 8: Formula for estimating the overrepresentation of a social attribute k.

  1. Assigning Weights:
  • If a summary helps balance the corpus (e.g., it overrepresents an attribute that the model usually ignores), it gets a higher weight.
  • If a summary worsens the imbalance, it is penalized more heavily.

The weight for the chosen summary (\(w_{c,k}\)) is calculated as:

Equation 9: Weight calculation for the chosen summary based on over/underrepresentation ratios.

Similarly, the weight for the rejected summary (\(w_{r,k}\)) is:

Equation 10: Weight calculation for the rejected summary.

Notice the ratio \(O(k)/U(k)\). If attribute \(k\) is heavily overrepresented (\(O > U\)), the weight adjusts to de-prioritize summaries that continue to overrepresent it, and prioritize summaries that give voice to the underrepresented \(U\).

The Mathematical Intuition

Why design the objective this way? The authors provide a derivation showing that FairPO’s derivative maintains the stability of DPO while injecting fairness constraints.

The derivative of standard DPO looks like this:

Equation 14: Derivative of the standard DPO objective.

The derivative of FairPO looks like this:

Equation 16: Derivative of the FairPO objective showing the weighted gradients.

Crucially, FairPO preserves the scaling factor \(\sigma(-m)\), which helps the model focus on “difficult” examples (where the model is unsure which summary is better). If one were to simply stick weights into the log probability terms of the standard DPO equation (a naive approach), it would distort the reward margin \(m\), making the training less stable.

Naive approach (Weighted DPO) equation: Equation 17: A naive weighted DPO objective which distorts the margin.

This naive approach results in a weighted margin \(m'\): Equation 19: The distorted weighted margin m’ resulting from the naive approach.

Because the terms inside the margin are weighted differently, \(m'\) stops being a clean measure of the model’s ability to distinguish chosen vs. rejected summaries. FairPO avoids this by applying weights outside the log-ratio structure in the gradient, preserving the effectiveness of the preference optimization.

Experiments and Results

The researchers tested FairPO on three diverse datasets:

  1. Amazon: Product reviews (Attributes: Negative, Neutral, Positive sentiment).
  2. MITweet: Tweets about various topics (Attributes: Left, Center, Right ideology).
  3. SemEval: Tweets regarding stance on targets like “Climate Change” (Attributes: Support, Against).

They applied FairPO to three popular LLMs: Llama3.1, Mistral, and Gemma2.

Comparison with Baselines

They compared FairPO against:

  • Original Models: The base LLMs without tuning.
  • DPO: Standard preference optimization without fairness weights.
  • OPTune: Another preference tuning method.
  • Prompting: Explicitly asking the model to be fair.
  • Policy Gradient: A Reinforcement Learning approach.

The results, shown in Table 2, are compelling.

Table 2: Experimental results comparing FairPO to baselines across datasets and models.

Key Takeaways from the Results:

  1. FairPO dominates: It achieves the lowest (best) scores for both Equal Coverage (EC) and Coverage Parity (CP) across almost all models and datasets.
  2. Corpus-level improvement: Look at the CP scores for Llama3.1 on the Amazon dataset. The base model has a score of 1.89. FairPO reduces this to 0.42. This indicates that FairPO drastically reduces systematic bias across the entire dataset.
  3. DPO is not enough: While standard DPO improves fairness slightly over the base model, FairPO significantly outperforms it, proving that the specific fairness-aware weighting mechanism is necessary.

Ablation Study

Is the complex weighting really necessary? Or is the perturbation doing all the work? The authors ran ablation studies to find out.

Table 3: Ablation study results showing the impact of removing perturbation or fairness-aware weighting.

  • w/o pert. (Without Perturbation): Using random summaries instead of perturbed ones hurts performance. The perturbation is essential for creating high-quality “fair vs. unfair” pairs.
  • w/o fair. (Without Fairness Tuning): Using standard DPO weights increases the CP (Corpus Parity) error. This confirms that the dynamic weighting (\(w_c, w_r\)) is critical for fixing systematic bias.

Does Fairness Hurt Quality?

A common fear in AI ethics is the “alignment tax”—the idea that making a model safer or fairer makes it less capable. The authors evaluated the summaries for Fluency, Relevance, and Factuality using Prometheus 2 (an evaluator LLM).

Table 4: Summary quality comparison showing FairPO maintains or improves quality compared to baselines.

The results (Table 4) show pairwise comparisons. A positive number means the tuned model beat the original model.

  • FairPO maintains or even improves quality across the board.
  • Prompting, by contrast, devastated the quality (negative scores), likely because the long, complex prompts confused the models or forced them into unnatural phrasing.

Qualitative Analysis

Finally, let’s look at an actual example of the summaries produced.

Figure 3: Sample summaries generated by DPO and FairPO on Amazon reviews.

In the Mistral example (middle column), the standard DPO summary mentions “generally positive reviews” and lists pros, with a small caveat about battery life.

The FairPO summary, however, explicitly leads with “This Toshiba tablet receives mixed reviews.” It gives equal weight to the praise (lightweight, fast) and the concerns (battery life, damaged products). This nuance—acknowledging the conflict rather than smoothing it over—is exactly what makes the summary fair.

Conclusion

Multi-document summarization is a powerful tool, but it carries the risk of silencing minority opinions or amplifying majority biases. This paper demonstrates that we don’t have to accept this trade-off.

FairPO introduces a robust framework for aligning LLMs with fairness objectives. By combining perturbation-based data generation (to teach the model what fairness looks like) with fairness-aware preference tuning (to teach the model how important fairness is regarding corpus balance), FairPO significantly reduces bias. Importantly, it achieves this without degrading the coherence or factuality of the summaries.

For students and researchers in NLP, FairPO illustrates a broader lesson: standard optimization objectives (like DPO) can be mathematically adapted to solve specific ethical constraints, provided we can formulate those constraints (like Coverage Parity) into differentiable or weight-adjustable signals.

This blog post explains the research presented in “Improving Fairness of Large Language Models in Multi-document Summarization” by Haoyuan Li, Rui Zhang, and Snigdha Chaturvedi.