Bridging the Gap: How Domain Generalization Helps LLMs Master Keyphrases in New Fields
In the vast ocean of digital information, Keyphrase Generation (KPG) acts as a critical lighthouse. It condenses lengthy documents into a few punchy, representative phrases that summarize the core content. This technology powers search engines, document clustering, and recommendation systems.
Traditionally, training these models required massive datasets of documents paired with human-annotated keyphrases. This works perfectly fine for academic papers, where datasets like KP20k are abundant. But what happens when you need to generate keyphrases for a completely different domain—say, biomedical reports or news articles—where you have zero labeled data?
This is the challenge of Unsupervised Cross-Domain Keyphrase Generation.
Large Language Models (LLMs) like GPT-4 have shown incredible promise here. They are “few-shot learners,” meaning if you show them a few examples (demonstrations) in the prompt, they can perform a task reasonably well. However, a major problem remains: Which examples do you show? If you are processing a medical document but only have labeled computer science papers as examples, the LLM might get confused by the stylistic differences. This phenomenon is known as distribution shift.
In this post, we will dive deep into a recent research paper that proposes a novel solution: Seeking Rational Demonstrations (SRD). We will explore how the authors use advanced Domain Generalization theory to teach a retrieval model how to find the perfect examples, bridging the gap between different domains without needing a single label in the target field.
The Core Problem: Distribution Shift
To understand the solution, we must first quantify the problem. When we train a model on one type of data (Source Domain, e.g., academic papers) and test it on another (Target Domain, e.g., technical Q&A forums), the statistical properties of the text change. This is the distribution shift.
The authors of the paper visualized this shift using a metric called Maximum Mean Discrepancy (MMD). MMD measures the distance between two probability distributions. A higher MMD value means the datasets are more dissimilar.

As shown in Figure 1, the authors treated the KP20k dataset (Computer Science papers) as the “Source Domain.” They then calculated the distance to various other datasets. You can see that academic datasets like Inspec or NUS are somewhat close. However, datasets like KPBiomed (Medicine) or StackExchange (Technical websites) are far away.
The further the distance, the harder it is for a model trained on the source to generalize to the target. In the context of LLMs, if we simply pick random examples from the source domain to prompt the LLM for a target task, the mismatch in style and vocabulary can degrade performance. We need a way to find “rational” demonstrations—source examples that are semantically useful for the target input, despite the domain gap.
The Solution: Seeking Rational Demonstrations (SRD)
The authors propose a framework called SRD. The intuition is simple yet powerful: instead of randomly picking examples, let’s use a Retrieval Model to find the most relevant source examples for a given target input.
However, a standard retrieval model might also suffer from distribution shift. To fix this, the authors integrate Domain Generalization techniques directly into the training of the retriever. They force the retriever to learn a “common language” (feature space) where the source and target domains align.
Here is the high-level architecture of the SRD approach:

As illustrated in Figure 2, the process is split into two stages:
- Training Stage (Left): A dual-encoder retrieval model is trained. It minimizes a loss function that combines contrastive learning (matching queries to candidates) with MMD regularization. This aligns the feature distributions of the source and target domains.
- Inference Stage (Right): When a new, unlabeled target document arrives, the trained encoders retrieve the best labeled examples from the source dataset. These examples are then fed into the LLM as a “few-shot” prompt to generate the final keyphrases.
Let’s break down the mathematics and mechanics of how this works.
1. The Retrieval Mechanism
The goal is to find source samples (\(S\)) that maximize the probability of generating the correct keyphrases (\(y\)) for a target input (\(x_t\)).

The system uses a Dual-Encoder architecture (similar to DPR - Dense Passage Retrieval). One encoder processes the query (target document), and the other processes the candidates (source documents).
Since there are no labels in the target domain, the model initially needs to learn what “relevance” looks like using only the source domain. The authors construct positive and negative pairs from the source dataset by comparing keyphrases. They calculate a Relevance Score based on both semantic embedding similarity and Jaccard similarity (word overlap).

If this score is above a certain threshold, the pair is considered a match (positive sample); otherwise, it is a negative sample. This allows the model to learn basic retrieval logic.
2. Conquering Distribution Shift with MMD
Training on source pairs is not enough. If the model only sees source data, it will overfit to the source style. When it sees a target document later, it might produce a feature vector that makes no sense.
To prevent this, the authors introduce a Domain Projection Loss based on MMD.
The theoretical basis comes from the concept of \(\mathcal{H}\)-divergence, which bounds the error risk when moving between domains.

The theory suggests that the error on the target domain is bounded by the error on the source domain plus the divergence (distance) between the two domains.

To minimize the error on the target domain (\(\epsilon_t\)), we must minimize the divergence between the domains. The authors achieve this by minimizing the squared MMD between the source distribution (\(\mathcal{D}_s\)) and the target distribution (\(\mathcal{D}_t\)) in the feature space.

In simpler terms, this equation forces the encoder to map source documents and target documents to the same area in the vector space. It acts like a magnet, pulling the two distinct “clouds” of data points together until they overlap.
3. Preserving Domain Characteristics
However, simply smashing the two distributions together can be dangerous. If you align them perfectly, you might lose the unique characteristics that make specific documents distinct. You might wash away the specific signals needed for accurate retrieval.
To solve this, the authors introduce a Domain Characteristic Loss (or Orthogonality Loss).
They compute the mean (\(\mu\)) and variance (\(\sigma\)) vectors for the source and target batches. They then enforce these vectors to be orthogonal (perpendicular) to each other.

By minimizing the dot product between the source and target statistics, the model is encouraged to keep the domain-specific “style” information separate from the shared semantic content. This ensures the representation remains rich and diverse, preventing the “feature collapse” that can happen with aggressive MMD alignment.
4. The Unified Objective
Finally, the training objective combines three components:
- Contrastive Loss (\(\mathcal{L}_{contrastive}\)): Standard retrieval training (make positive pairs close, negative pairs far).
- MMD Loss: Align the domains so the retriever works for the target domain.
- Domain Loss (\(\mathcal{L}_{domain}\)): Preserve specific characteristics via orthogonality.

This holistic approach trains a retriever that is robust, domain-agnostic, yet sensitive to nuances.
Experimental Setup and Results
Does this complex mathematical framework actually translate to better keyphrases? The researchers tested their approach against several baselines on five diverse datasets.
The Datasets
The breakdown of the test datasets is shown below. Note the variety: StackExchange (Tech), DUC-2001 (News), and KPBiomed (Medicine). All models were trained/sourced from KP20k (Academic CS papers).

Performance Comparison
The results, measured in F1 scores (accuracy) and Recall (finding all keyphrases), are highly impressive.

Key takeaways from Table 2:
- Baselines Struggle: Previous unsupervised methods like AutoKeyGen and UOKG achieve average F1@5 scores around 13-14%.
- LLMs Need Help: A raw Llama3.3-70b or ChatGPT-3.5 without optimized demonstrations performs poorly (e.g., Llama gets 4.10% average on absent keyphrases).
- SRD Excellence: When the LLMs are prompted with demonstrations retrieved by the SRD method (Ours), performance skyrockets.
- Ours(GPT4o) achieves a massive 26.50% average F1 score for present keyphrases.
- Even the smaller Ours(DeepSeek) model is highly competitive, scoring 25.26%.
- Absent Keyphrases: Generating keyphrases that don’t appear in the text (Absent) is notoriously hard. SRD improves the Recall@10 significantly compared to zero-shot approaches.
Robustness of Sampling
One might wonder: how much data do we need to construct the query set for training the retriever? The authors analyzed the performance based on the ratio of samples used.

Figure 3 shows that performance generally peaks when using about 20% to 30% of the data. Interestingly, using too much data (40%) can sometimes hurt performance, likely because the query set starts to accumulate redundant or noisy samples. This suggests the method is efficient and doesn’t require the entire dataset to be effective.
Conclusion
The “Seeking Rational Demonstrations” (SRD) approach represents a significant step forward in applying Large Language Models to specialized domains.
By acknowledging the distribution shift between training data (like academic papers) and real-world applications (like medical reports), the authors crafted a solution that doesn’t just hope for the best. Instead, they mathematically forced the retrieval model to “bridge the gap” using MMD and Orthogonality losses.
The result is a system where an LLM can be dropped into a completely new environment—without any labeled training data—and still generate accurate, high-quality keyphrases, simply because it is being fed the most “rational” and relevant examples from its original knowledge base.
For students and practitioners in NLP, this paper serves as a perfect example of how classical machine learning theory (Domain Generalization) can be combined with modern Generative AI to solve the persistent problem of data scarcity.
](https://deep-paper.org/en/paper/file-2382/images/cover.png)