In the vast ocean of digital information, finding exactly what you need often relies on a very small set of words: Keyphrases.
Keyphrase Generation (KPG) is the task of automatically reading a document and producing a concise list of phrases that capture its core concepts. Ideally, these keyphrases act as an index, helping with information retrieval, text summarization, and categorization.
However, KPG is deceptively difficult. It requires two competing capabilities:
- Recall: The ability to find every important concept (casting a wide net).
- Precision: The ability to ensure that the chosen concepts are accurate and not redundant (filtering the catch).
For years, researchers have struggled to build a single model that excels at both. If you optimize for recall, you often get a lot of noise. If you optimize for precision, you miss important details.
In this post, we will dive deep into a paper titled “ONE2SET + Large Language Model: Best Partners for Keyphrase Generation.” This research proposes a clever “Generate-then-Select” framework. It argues that instead of forcing one model to do everything, we should split the job: use a specialized Generator to maximize recall, and a Large Language Model (LLM) acting as a Selector to ensure precision.
We will break down how they used Optimal Transport theory to train a better generator and how they turned an LLM into a ruthless editor to filter the results.
Background: The Struggle for Balance
Before we get into the solution, we need to understand the nuances of the problem.
Present vs. Absent Keyphrases
Keyphrases come in two flavors:
- Present Keyphrases: These act like highlighters. They appear verbatim in the document text (e.g., finding the phrase “neural networks” inside a paper about AI).
- Absent Keyphrases: These require semantic understanding. They are concepts relevant to the paper but do not appear in the text itself. For example, a paper might discuss “logit adjustments” and “softmax,” and the correct keyphrase might be “probability distribution,” even if those specific words represent a concept implied by the text.
Generating absent keyphrases is significantly harder because the model cannot simply “copy and paste.”
The Paradigm Shift: One2Seq vs. One2Set
Traditionally, models treated KPG as a sequence generation task (One2Seq). Imagine asking a model to write a sentence where the words are the keyphrases. The issue here is order. Keyphrases are a set—it doesn’t matter if you say “AI, Data, Tech” or “Tech, AI, Data.” Forcing a sequential order on a set confuses the model during training.
The One2Set paradigm solved this by using “Control Codes.” Imagine the model has \(N\) slots (control codes). It tries to fill these slots with keyphrases in parallel. This approach is excellent for Recall because the model generates many options simultaneously.
Enter the LLM
Recently, Large Language Models (LLMs) like GPT-4 or LLaMA have entered the scene. They are brilliant at understanding context. However, they are computationally heavy, and when asked to generate keyphrases from scratch, they can be unpredictable—sometimes hallucinating or being overly verbose.
The researchers behind this paper asked a simple question: What if we stop trying to make one model do it all?
Preliminary Study: Why Split the Task?
The researchers started by benchmarking existing state-of-the-art models to see where they failed. They compared traditional models (like SetTrans) against LLMs (like LLaMA-2-7B).

As shown in Figure 1, there is a clear trade-off.
- Plot (a) - Present Keyphrase: Notice how the models drift. As Precision (x-axis) goes up, Recall (y-axis) tends to go down.
- Plot (b) - Absent Keyphrase: This is even more stark. The numbers are lower overall, but the trade-off persists.
The researchers found that SetTrans (a One2Set model) naturally gravitates toward high recall. It throws out a lot of ideas. LLaMA, on the other hand, has a better semantic grasp but struggles to generate a comprehensive list from scratch without getting bogged down.
To confirm this, they ran a specific experiment comparing the recall capabilities of LLaMA-2-7B against SetTrans when both are forced to generate the same number of phrases.

Figure 2 tells a compelling story. The blue bars (SetTrans) are consistently higher than the pink bars (LLaMA-2-7B) for Recall (@M). This confirms that for the specific job of finding candidates (Recall), the specialized One2Set architecture is superior to the general-purpose LLM.
However, SetTrans generates a lot of “trash” along with the “treasure.” This leads to the proposed Generate-then-Select Framework:
- The Generator (SetTrans improved): Focus purely on Recall. Generate a large pool of candidates.
- The Selector (LLM): Focus purely on Precision. Filter the pool and keep only the best.
Part 1: The Generator (Improving One2Set with Optimal Transport)
The team chose the One2Set paradigm for the generator, but they identified a flaw in how these models are usually trained.
The Assignment Problem
In a One2Set model, you have \(N\) control codes (think of them as empty buckets) and a list of ground-truth keyphrases (the balls you want to put in the buckets). During training, you have to tell the model which bucket should have predicted which keyphrase.
Traditionally, this is done using Bipartite Matching. You pair one bucket to one keyphrase essentially 1-to-1.
- The Problem: You usually have more buckets (\(N=20\)) than actual keyphrases (\(M=5\)). This leaves 15 buckets assigned to “Null” (\(\emptyset\)). If a bucket gets assigned “Null” too often, it stops trying to learn anything useful. Furthermore, maybe a specific keyphrase could have been predicted by multiple buckets, but the rigid matching forces it into only one.
The Solution: Optimal Transport (OT)
The researchers propose using Optimal Transport theory to assign these targets. Instead of a rigid “Marriage” matching, think of it as a supply chain problem.
- Suppliers (Ground Truths): Have a certain amount of “truth” to distribute.
- Demanders (Control Codes): Want to learn a keyphrase.
- Cost: How “expensive” (inaccurate) it is to assign a specific truth to a specific code based on what the code currently predicts.

Figure 3 visualizes this flow. We have predictions on top and ground truths on the bottom. We calculate a “Matching Score” matrix \(\mu_{ij}\) based on how similar the prediction is to the truth.
\[ \mu _ { i j } = \frac { \mathcal { C } _ { m a t c h } ( y _ { i } , \hat { y } _ { j } ) ^ { \frac { 1 } { \tau } } } { \displaystyle \sum _ { j = 1 } ^ { N } \mathcal { C } _ { m a t c h } ( y _ { i } , \hat { y } _ { j } ) ^ { \frac { 1 } { \tau } } } , \]Using this score, the algorithm calculates the “Cost” of assignment. If the model’s prediction is very close to the truth, the cost is low (negative score), making it a desirable assignment.
\[ c _ { i j } = { \left\{ \begin{array} { l l } { - \mu _ { i j } , } & { { \mathrm { i f ~ } } y _ { i } \neq \emptyset . } \\ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} \right. } \]Why does this matter?
The magic happens in how they define the “Supply” (\(s_i\)). In standard matching, every keyphrase has a supply of 1. It can go to one bucket. In this OT approach, they use a dynamic supply:
\[ s _ { i } = \left\{ \begin{array} { l l } { \displaystyle [ \sum \mathrm { t o p K } ( \{ \mu _ { \mathrm { i j } } \} _ { \mathrm { j = 1 } } ^ { \mathrm { N } } , \mathrm { k } ) ] , } & { \mathrm { i f } y _ { i } \neq \emptyset . } \\ { \displaystyle N - \sum _ { y _ { i ^ { \prime } } \neq \emptyset } s _ { i ^ { \prime } } , } & { \mathrm { o t h e r w i s e } . } \end{array} \right. \]Basically, if a ground-truth keyphrase matches well with multiple predictions, the algorithm increases its supply. One truth can now teach multiple control codes. This prevents control codes from starving (learning nothing) and drastically improves the model’s ability to recall keyphrases later.
Finally, they solve for the optimal plan \(\pi^*\) that minimizes total cost:
\[ \begin{array} { l } { { \pi ^ { * } = \displaystyle \arg \operatorname* { m i n } _ { \pi } \sum _ { i = 1 } ^ { M } \sum _ { j = 1 } ^ { N } c _ { i j } \pi _ { i j } , \quad \pi \in \mathbb R ^ { M \times N } } } \\ { { \mathrm { s . t . ~ } \displaystyle \sum _ { i = 1 } ^ { M } \pi _ { i j } = d _ { j } , \quad \sum _ { j = 1 } ^ { N } \pi _ { i j } = s _ { i } , } } \\ { { \displaystyle \sum _ { i = 1 } ^ { M } s _ { i } = \sum _ { j = 1 } ^ { N } d _ { j } , \quad \pi _ { i j } \geq 0 . } } \end{array} \]By solving this optimization, the generator learns more robustly, producing a wider variety of correct candidates.
Part 2: The Selector (LLM as a Sequence Labeler)
Now that we have a Generator producing a massive list of potential keyphrases (both good and bad), we need to filter them. The team employs LLaMA-2-7B for this.
The Problem with Reranking
The standard approach to filtering is “Reranking.” You give the LLM a list of candidates and ask it to score each one individually (e.g., “Score: 0.9”). You then keep the top 10.
The fatal flaw here is Semantic Repetition.
- Candidate A: “Safe hazard” (Score 0.8)
- Candidate B: “Safe problem” (Score 0.79)
These two might mean the exact same thing in context. A reranker looks at them individually and thinks both are good, so it keeps both. This hurts diversity and precision.
The Solution: Sequence Labeling
The researchers propose a Sequence Labeling approach. They feed the entire list of candidates to the LLM at once and ask it to output a sequence of decisions: True (T) or False (F).

As seen in Figure 4:
- (a) Conventional Reranking: Scores items individually. It keeps “safe problem” and “safe hazard” because both have high scores.
- (b) Sequence Labeling: The model outputs a sequence “T F T T”. Crucially, because the LLM is autoregressive (it generates one token after another), when it decides the label for the second candidate, it is aware that it just marked the first one as True.
If the model sees it just selected “Safe problem” (T), and the next item is the synonym “Safe hazard,” it recognizes the redundancy and marks it False (F). This dynamic context awareness is a massive advantage.
Handling Imbalanced Data
Since the generator produces many more bad candidates than good ones, the training data for the Selector is imbalanced (lots of “F"s, few “T"s). To prevent the LLM from becoming lazy and just guessing “F” every time, they use a balanced loss function:
\[ \begin{array} { c } { { \displaystyle \mathcal { L } ( \phi ) = \frac { 1 } { N _ { T } } \sum _ { t = 1 } ^ { | Y | } \mathbb { I } _ { \{ Y _ { t } = T \} } \log p _ { \phi } ( Y _ { t } | X , Y _ { < t } ) + } } \\ { { \displaystyle \frac { 1 } { N _ { F } } \sum _ { t = 1 } ^ { | Y | } \mathbb { I } _ { \{ Y _ { t } = F \} } \log p _ { \phi } ( Y _ { t } | X , Y _ { < t } ) , } } \end{array} \]This equation effectively averages the loss for “True” labels and “False” labels separately, ensuring that correct selections are weighted just as heavily as rejections, regardless of how few there are.
R-tuning S-infer Strategy
One final clever trick: The order of candidates matters.
- Training (R-tuning): They feed candidates in Random order. This forces the LLM to actually read and understand the text rather than relying on the position (e.g., “the first ones are usually the best”).
- Inference (S-infer): When actually using the model, they Sort the candidates by the generator’s confidence score. This puts the most likely candidates first, helping the LLM make high-quality decisions early in the sequence.
Experiments and Results
Does this complex setup actually work? The team tested the framework on five benchmark datasets (Inspec, Krapivin, NUS, SemEval, KP20k).
Main Performance
The results were overwhelming.

Looking at Table 2, the proposed framework (“Our generator + Our selector”) achieves the best performance (bold numbers) in almost every category.
- Absent Keyphrases (The hard part): Look at the “Abs” columns. The improvement is massive compared to previous state-of-the-art models like SimCKP. This confirms that the generator’s recall combined with the LLM’s understanding allows the system to find conceptual keyphrases that aren’t written in the text.
Visualizing the Improvement
To visualize exactly how much the new Generator helps, the researchers plotted the Recall/Precision curve again.

Figure 5 compares the new Generator (blue circles) against the standard SetTrans (gray triangles).
- Plot (b) Absent Keyphrase: The blue line is significantly higher than the others. This gap represents the “Optimal Transport” advantage—the model learned to predict absent keyphrases much more effectively because the training assignment wasn’t so rigid.
Reducing Redundancy
Finally, did the Sequence Labeling (T/F) strategy actually fix the redundancy issue?

Table 4 measures diversity. Lower numbers for emb_sim (embedding similarity) and Dup_token_ratio mean less repetition.
- Our Selector: Scores 0.222 on token duplication.
- SLM-Scorer (Traditional Reranker): Scores 0.330.
This proves that asking the LLM to output a sequence of True/False decisions significantly reduces the number of synonymous or repetitive keyphrases compared to traditional scoring.
Conclusion
The paper “ONE2SET + Large Language Model” teaches us an important lesson about modern AI system design: Specialization wins.
Instead of hoping a single LLM can do everything, or sticking to older specialized architectures, this paper combines them.
- Specialized Generator: Uses Optimal Transport to maximize the “net” (Recall), ensuring no potential keyphrase is left behind.
- Generalist Selector: Uses an LLM’s semantic reasoning to “sort the catch” (Precision), removing noise and redundancy via context-aware sequence labeling.
The result is a framework that sets a new standard for Keyphrase Generation, particularly for those elusive “absent” keyphrases that represent the deeper meaning of a text. For students and researchers in NLP, this highlights the potential of Hybrid Systems—where the efficiency of smaller, specialized models meets the reasoning power of Large Language Models.
](https://deep-paper.org/en/paper/2410.03421/images/cover.png)