Ranking Matters - How Learning-to-Rank Improves Query-Focused Summarization

Introduction

Imagine you are handed a 50-page transcript of a corporate meeting and asked a single question: “Why did the marketing team disagree with the engineering team about the budget?”

To answer this, you wouldn’t summarize the entire meeting. You wouldn’t care about the opening pleasantries, the coffee break discussions, or the unrelated IT updates. You would scan the document, identify the specific segments where marketing and engineering discussed finances, rank them by importance, and then synthesize an answer.

This is the core challenge of Query-Focused Summarization (QFS). Unlike generic summarization, which aims to capture the “gist” of a document, QFS requires the model to act like a search engine and a writer simultaneously. It must retrieve relevant information and weave it into a coherent narrative.

However, current state-of-the-art models often struggle with the “retrieval” part of this equation, especially when dealing with long documents. They might process the text, but they treat all input segments as roughly equal, failing to explicitly prioritize the most critical sections.

In this post, we will dive deep into a research paper titled “Learning to Rank Salient Content for Query-focused Summarization” by Sajad Sotudeh and Nazli Goharian. The researchers propose a novel architecture that teaches a summarization model not just to write, but to rank. By integrating a “Learning-to-Rank” (LTR) mechanism directly into the summarization architecture, they achieve state-of-the-art results.

We will unpack how they adapted the Transformer architecture, how they trained the model to judge “importance,” and what this means for the future of processing long documents.

Background: The Challenge of Long Inputs

Before understanding the solution, we must understand the specific constraints of the problem.

The Context Window Problem

Modern Natural Language Processing (NLP) is dominated by Transformer models (like BERT, BART, and GPT). While powerful, these models have a limitation: the context window. A standard model can only process a certain number of tokens (words or sub-words) at once—often 512 or 1024 tokens.

Real-world documents, such as meeting transcripts, books, or legal files, are much longer than this. The QMSum dataset used in this paper, for example, has an average length of 9,000 tokens. You cannot simply feed this into a standard summarizer; it won’t fit.

The “Segment Encoding” Solution

To handle long documents, researchers typically use a “divide and conquer” strategy known as Segment Encoding.

Segmentation: The long document is chopped into smaller, fixed-length chunks (e.g., 512 tokens).
Encoding: Each chunk is passed through the encoder separately.
Decoding: The decoder (the part of the model that writes the summary) looks at the encoded representations of all these chunks simultaneously to generate the output.

This architecture, often referred to as SegEnc, is the backbone of many current systems. However, SegEnc has a weakness: it is implicit. It hopes the decoder figures out which chunks are important via attention mechanisms, but it doesn’t explicitly teach the model to distinguish between a “critical” chunk and a “useless” chunk.

This is where the authors of this paper step in. They argue that we need to explicitly force the model to learn the relative importance of these segments.

Core Method: LTR-Assisted Summarization

The researchers propose a system called LTRSum (Learning-to-Rank Summarization). The intuition is simple: if the model knows which segments are the most important (Ranking), it will do a better job of attending to them when writing the summary (Generation).

The Architecture

The beauty of LTRSum lies in its efficiency. Rather than building two separate massive models—one for ranking and one for summarizing—the authors use a Multi-Task Learning approach with shared parameters.

Overview of the LTRSum system showing shared decoder architecture.

As shown in Figure 1 above, the architecture processes the input as follows:

Input: The document is split into segments. The query is prepended to each segment (so the model always knows what it’s looking for).
Encoder: A standard Transformer encoder processes these segments.
Shared Decoder: This is the innovation. The system uses a single decoder to perform two distinct tasks:

Task A (Summarization): The decoder generates the text of the summary token by token.
Task B (Learning-to-Rank): The decoder outputs a score for each segment indicating its relevance to the query.

In the diagram, you see two “Decoder” blocks, but the dashed line labeled “Shared” indicates that these are actually the same neural network weights. The model performs two forward passes: one to generate text and one to calculate rankings.

The Ranking Mechanism

How exactly does a text generator produce a ranking score?

The researchers adapted a technique used in information retrieval. For the ranking task, they take the encoder’s representation of a segment and pass it through the shared decoder. They then apply a specialized “Head” (a small Feed-Forward Neural Network) to the output.

The ranking score generation is defined by this equation:

Equation for generating ranking scores using a Feed-Forward Neural Network.

Here:

\(Enc(S_i)\) is the encoded segment.
\(Dec_{LTR}\) is the shared decoder operating in “ranking mode.”
\(\hat{y}_i\) is the predicted relevance score for segment \(i\).

The model essentially produces a single number (logit) for each segment representing “how good is this segment for answering the query?”

Training the Ranker: Where do labels come from?

To train a model to rank segments, you need “ground truth” data. You need to tell the model, “Segment 1 is bad, Segment 2 is excellent.” However, summarization datasets (like QMSum) only provide the source document and the final summary. They don’t tell you which sentences in the source document were used to write that summary.

To solve this, the authors created pseudo-labels. They used a probabilistic heuristic to measure the alignment between a source segment and the ground-truth summary.

Equation for scoring segment relevance based on span probability.

This formula calculates a score based on spans of text that appear in both the segment and the summary.

\(|\text{span}_j|\) is the length of a matching text span.
\(p_j\) is the probability that this span is actually relevant (derived from an external alignment tool called SUPERPAL).

Intuitively, if a segment contains many long phrases that also appear in the human-written summary, it gets a high score. These scores are then used to sort the segments, creating a “Gold Standard” ranking list for training.

The Joint Loss Function

Finally, to train the model, the authors combine two different objectives.

1. The Ranking Loss: They use a Listwise Softmax Cross-Entropy Loss. This is a standard loss function in Learning-to-Rank that looks at the entire list of segments and penalizes the model if the predicted order is wrong.

Softmax loss equation for the ranking task.

2. The Total Loss: The final training objective is a weighted sum of the generation loss (how well did it write the summary?) and the ranking loss (how well did it prioritize segments?).

Total loss equation combining generation and ranking.

The parameter \(\lambda\) (lambda) is a tuning knob that balances the two tasks. In their experiments, setting \(\lambda = 1\) worked best, meaning both tasks were treated as equally important.

Experiments & Results

The researchers tested LTRSum on two challenging datasets:

QMSum: Query-based meeting summarization (highly conversational, very long).
SQuALITY: Question-focused summarization for stories (narrative text).

They compared their model against several strong baselines, including the standard SegEnc model and other recent systems like “Ranker-Generator” and “SOCRATIC.”

Automatic Metrics (ROUGE and BERTScore)

The primary metrics used were ROUGE (which measures word overlap between the generated summary and the reference) and BERTScore (which measures semantic similarity).

Table showing ROUGE and BERTScore comparisons.

Table 1 (above) reveals several key findings:

Dominance on QMSum: On the meeting dataset (Table 1a), LTRSum (the last row) outperforms all baselines across all metrics. It beats the contrastive learning model (QontSum) and the question-driven pre-training model (SOCRATIC).
Competitive on SQuALITY: On the story dataset (Table 1b), LTRSum performs comparably to the state-of-the-art. Notably, it achieves a significantly higher ROUGE-L score (+5.4% improvement over standard SegEnc). ROUGE-L measures the longest common subsequence, which is a good proxy for fluency and sentence structure.

Why the ROUGE-L boost? The authors noted that LTRSum tends to generate more concise summaries. By correctly ranking the important segments, the model avoids “rambling” or including irrelevant details that dilute the summary.

Table showing average summary lengths.

As seen in Table 2, LTRSum generates summaries that are generally shorter and tighter than the baseline SegEnc model, closer to the optimal length required to answer the queries.

Human Evaluation

Automatic metrics are useful, but they don’t tell the whole story. The researchers also conducted a human study where experts rated the summaries on three criteria:

Fluency: Is the grammar and flow good?
Relevance: Does it actually answer the specific query?
Faithfulness: Is the information factually true to the source?

Table showing human evaluation results.

Table 3 shows that human judges preferred LTRSum, particularly for Relevance (4.36 vs 4.15 for QontSum) and Faithfulness. This confirms the hypothesis: because the model explicitly learns to rank segments, it is less likely to hallucinate or drift off-topic.

Does the Ranking Actually Work?

To verify if the “Learning-to-Rank” component was actually learning to rank, the researchers measured the NDCG (Normalized Discounted Cumulative Gain). NDCG is the gold-standard metric for ranking problems; a higher score means the relevant items appear at the top of the list.

The equations for calculating this are:

Equation for DCG. Equation for nDCG.

Basically, this math checks if the model put the most useful paragraphs at the top of its internal list.

Bar chart showing segment retrieval performance (NDCG).

Figure 2 demonstrates that LTRSum (the rightmost group of bars) achieves higher NDCG scores than the baselines on the QMSum dataset (blue bars) and is competitive on SQuALITY. This proves the shared decoder successfully learned to discriminate between important and unimportant segments.

Deeper Analysis: Broad vs. Specific Queries

One of the most interesting findings in the paper was how the model behaves with different types of questions.

The researchers categorized queries into:

Broad Queries: Questions that require synthesizing information from many different parts of the document (e.g., “Summarize the discussion about price issues”).
Specific Queries: Questions targeting a single detail or a specific moment (e.g., “Why did the Marketing team disagree with Design?”).

Table comparing performance on Broad vs. Specific queries.

Table 4 shows the “Win/Tie/Lose” percentage of LTRSum against other models.

Broad Queries (The Strength): LTRSum dominates on broad queries. This makes sense—when an answer is scattered across the document, ranking all segments correctly is crucial to gathering the full picture.
Specific Queries (The Weakness): The model struggles slightly more with very specific queries compared to baselines like SOCRATIC. If the answer relies on a single sentence buried in a segment, and the ranking model gives that segment a slightly lower score, the information might be lost.

Error Analysis: Why does it fail?

No model is perfect. The authors performed a qualitative error analysis to understand where LTRSum falls short. They identified two main failure modes:

Imbalanced Labels: In some cases, there are very few “gold” segments and many “noise” segments. This imbalance makes it hard for the ranker to learn, causing it to select segments that are partially relevant but miss the main point.
Segment Summarizer Deficiency: Sometimes, the ranker works perfectly—it finds the correct segment—but the summarizer fails to extract the right sentence within that segment.

Comparison of human vs. model summaries illustrating errors.

Table 5 illustrates these errors.

Left Example (Imbalance/Hallucination): The model identifies the right general area (budget concerns) but hallucinates details about durability that weren’t the main point of the query.
Right Example (Summarizer Deficiency): The model correctly retrieves Segment 9 (the gold segment). However, while the human summarizer extracted the nuance about “user-friendliness,” the model focused on the physical description of the buttons. It found the haystack but missed the needle.

Conclusion and Implications

The paper “Learning to Rank Salient Content for Query-focused Summarization” provides a compelling blueprint for handling long documents. By explicitly teaching a model to prioritize information via a secondary ranking objective, we can generate summaries that are not only more accurate but also more concise and faithful to the source.

Key Takeaways:

Explicit Prioritization: Relying on implicit attention mechanisms isn’t enough for long documents. We need to force models to learn relevance explicitly.
Efficiency: You don’t need a separate ranking model. A shared decoder can learn to rank and summarize simultaneously, saving resources.
Relevance/Faithfulness: Better ranking leads to summaries that stick to the question and avoid hallucination.

This research paves the way for better search engines, smarter meeting assistants, and more reliable automated analysis tools. As we continue to feed AI larger and larger contexts, the ability to discern what matters becomes just as important as the ability to understand it.

Introduction#

Background: The Challenge of Long Inputs#

The Context Window Problem#

The “Segment Encoding” Solution#

Core Method: LTR-Assisted Summarization#

The Architecture#

The Ranking Mechanism#

Training the Ranker: Where do labels come from?#

The Joint Loss Function#

Experiments & Results#

Automatic Metrics (ROUGE and BERTScore)#

Human Evaluation#

Does the Ranking Actually Work?#

Deeper Analysis: Broad vs. Specific Queries#

Error Analysis: Why does it fail?#

Conclusion and Implications#