Introduction

Consider the phrase: “There is no greater love than to lay down one’s life for one’s friends.”

This is a biblical passage from John 15:13. In a traditional religious context, it might be used to discuss spiritual sacrifice. However, in the modern landscape of social media, the same text can be mobilized for vastly different purposes. It might be used in a tweet celebrating Pride Month to advocate for unconditional acceptance. Simultaneously, it could be used by a political leader to justify a military operation.

While the text remains identical, the context—and therefore the implied topic—shifts dramatically.

In the field of Natural Language Processing (NLP), we have become very good at Text Reuse Detection (TRD). We can easily identify that the Pride Month tweet and the political speech both contain the same quote. However, standard detection algorithms largely ignore recontextualization—the way the meaning or topic of a reused text transforms based on its new environment.

This blog post explores a framework proposed by researchers Francesco Periti, Pierluigi Cassotti, Stefano Montanelli, Nina Tahmasebi, and Dominik Schlechtweg. Their work, titled TROTR (Topic Relatedness of Text Reuse), introduces a novel method to evaluate how reused text changes topics over time and across documents. By shifting focus from simple “detection” to “topic relatedness,” they offer a path toward understanding complex linguistic phenomena like irony, political dog-whistles, and historical reception.

The Problem: Detection vs. Understanding

Traditional Text Reuse Detection assumes that if Text A appears in Document B, they are topically related. This assumption works well for plagiarism detection or tracking the spread of a news wire story. It fails, however, when analyzing how text is repurposed in dynamic discourse, such as on social media.

The researchers identified a gap in current NLP capabilities. While models can measure Semantic Textual Similarity (how much two texts mean the same thing conceptually), they struggle with Topic Relatedness.

To understand the distinction, look at the table below. It compares how standard NLP metrics view different pairings of text.

Table 1: Examples of semantic textual similarity, semantic textual relatedness, and topic relatedness. The table compares pairs of text based on paraphrasing, general relatedness, and the specific focus of TROTR: topic relatedness.

As shown in Table 1:

Paraphrase: The first pair uses different words to say the exact same thing (semantic similarity).
Topic Relatedness: The second pair discusses “Pride Month” and “LGBTQIA+ rights.” They are not paraphrases, but they share a high degree of topic relatedness.
The Trap: The third pair creates a problem. Both texts contain the exact same biblical quote (John 15:13). A standard text-matching algorithm sees 100% overlap in the quote and assumes they are related. However, one is about Pride Month and the other is about a military operation in Ukraine. In terms of topic, they are unrelated, despite sharing the reused text.

The TROTR framework was designed to solve this specific “trap” by teaching models to look past the reused text and analyze the surrounding context.

The TROTR Framework

The authors propose a framework that treats text reuse not as a binary “match/no-match” task, but as a graded evaluation of context. The framework consists of two distinct NLP tasks and a robust annotation process.

Figure 1: The TROTR framework consists of two tasks, called Text Reuse in-Context (TRiC) and Topic variation Ranking across Corpus (TRaC), along with a corresponding annotation process.

As illustrated in Figure 1, the framework is divided into two main tasks: TRiC and TRaC.

1. Text Reuse in-Context (TRiC)

This is a context-pair level task. The model is presented with a specific instance of reused text (Target \(t\)) appearing in two different contexts (\(c_1\) and \(c_2\)).

For example:

Target (\(t\)): “Love your neighbor”
Context 1 (\(c_1\)): [A sermon about charity] + “Love your neighbor” + [Call to donate].
Context 2 (\(c_2\)): [A political debate on immigration] + “Love your neighbor” + [Argument for policy change].

TRiC asks two questions:

Binary Classification: Do \(c_1\) and \(c_2\) share roughly the same topic? (Yes/No)
Ranking: On a continuous scale, how related are the topics of \(c_1\) and \(c_2\)?

This task structure is similar to the “Word-in-Context” (WiC) tasks used to distinguish word senses (e.g., distinguishing the “bank” of a river from a financial “bank”). However, TRiC applies this logic to whole text sequences rather than single words.

2. Topic Variation Ranking across Corpus (TRaC)

This is a corpus-level task. Instead of looking at a single pair, TRaC looks at the target text’s behavior across a whole dataset.

If a specific quote is always used in the exact same way (e.g., a technical definition), it has low topic variation (score closer to 0). If a quote is used in many different ways (e.g., a biblical verse used for grief, celebration, politics, and humor), it has high topic variation (score closer to 1).

The goal of TRaC is to rank different reused texts based on how “versatile” or “variable” their usage is across the corpus. This is crucial for diachronic studies—understanding how the usage of a phrase evolves over history.

The Benchmark Data: Bible Tweets

To train and test this framework, the researchers needed a dataset where text is reused frequently and in highly variable ways. They chose biblical passages on Twitter (now X).

Why the Bible?

High Reuse: Biblical verses are among the most quoted texts in history.
Identifying References: They often come with explicit citations (e.g., “John 3:16”), making them easier to collect without complex detection algorithms.
High Context Variety: As noted in the introduction, religious texts are applied to personal struggles, political movements, sports victories, and memes.

The authors collected tweets containing 42 different target passages. They then employed human annotators to judge pairs of these tweets on a 4-point scale:

4 (Identical): The topics are the same.
3 (Closely Related): The topics are very similar.
2 (Distantly Related): There is a slight connection.
1 (Unrelated): The topics have nothing in common.

The resulting dataset contains over 6,300 annotated pairs. Interestingly, the inter-annotator agreement was quite high (correlation of .811), suggesting that humans generally agree on when a quote is being “recontextualized” versus when it is being used in its original sense.

Experimental Setup: SBERT Models

To evaluate the difficulty of the TROTR tasks, the researchers tested 36 different Sentence-BERT (SBERT) models. SBERT is a modification of the standard BERT network that uses “siamese networks” to derive semantically meaningful sentence embeddings.

They tested two main architectures:

Bi-Encoders: These models process the two contexts independently. They create a vector embedding for Context A and a separate vector for Context B, then calculate the cosine similarity between them. This is fast and efficient.
Cross-Encoders: These models process the two contexts simultaneously as a single input pair. The model’s “attention” mechanism can look at words in Context A and Context B at the same time. This is usually more accurate but computationally expensive.

The “Masking” Strategy

The researchers hypothesized that models might cheat. Because the reused text (the quote) is identical in both contexts, a model might see the identical words and assume “High Similarity,” ignoring the surrounding context.

To test this, they introduced a +MASK setting. In this setting, the reused text itself is replaced with a placeholder (e.g., a dash “-”).

Original: “Putin quoted: Greater love has no one…” vs “Pride month is here! Greater love has no one…”
Masked: “Putin quoted: -…” vs “Pride month is here! -…”

By masking the quote, the model is forced to rely entirely on the surrounding words (“Putin,” “military,” vs. “Pride,” “month”) to determine if the topics are related.

Results and Analysis

The evaluation yielded several counter-intuitive and revealing results regarding how current AI models handle context.

1. Bi-Encoders Outperform Cross-Encoders

In many NLP tasks, Cross-Encoders are superior because they can compare inputs word-by-word. However, in the TRiC task, Bi-Encoders consistently performed better. This suggests that generating independent representations of the context is more effective for topic relatedness than trying to process the pair together, which might lead the Cross-Encoder to over-focus on the verbatim overlap of the quote.

2. The Failure of Pre-Trained Models

Pre-trained SBERT models (those not fine-tuned on the TROTR dataset) struggled. As seen in the table below, they had high Precision for “Label 0” (identifying different topics) but very low Recall. They biased heavily toward assuming texts were related because of the shared quote.

Table 2: TRiC evaluation on Subtask 1 and Subtask 2. Note the performance jump when using Fine-Tuning (+FT) and Masking (+MASK).

3. The “Similarity Bias” and the Power of Masking

The most significant finding is evident in the +MASK rows of Table 2.

When the researchers hid the quote (Masking), the models’ performance improved drastically.

Standard performance (F1 Score): ~0.60 to 0.70
Masked performance (F1 Score): ~0.80 to 0.90

This confirms the “Similarity Bias.” When the quote is visible, the model gets distracted by the identical words and fails to notice that the conversation around the words has changed. By removing the quote, the model is forced to look at the context, where the true topic signal lies.

4. Results on Corpus-Level Variation (TRaC)

The TRaC task results mirrored the TRiC findings. The goal here was to rank quotes by how much their usage varies.

Table 3: TRaC evaluation using the pre-trained models alone and in the +MASK setting.

As shown in Table 3, the correlation with human judgments (Spearman) jumped significantly when masking was applied (e.g., from .72 to .84 for the ADR model). This suggests that to truly understand how a text’s usage changes across a corpus, we must—paradoxically—ignore the text itself and look exclusively at its environment.

Conclusion and Implications

The TROTR framework highlights a critical limitation in current NLP: our models are often too focused on lexical overlap (matching words) and not enough on recontextualization (matching meaning).

The study proves that while standard SBERT models struggle to distinguish between a quote used in a sermon and a quote used in a political attack, simple interventions like masking can force the models to pay attention to the context.

Why Does This Matter?

Misinformation and Propaganda: Political actors often reuse trusted text (scientific studies, religious texts, historical quotes) to support unrelated or misleading narratives. TROTR provides a way to detect when a source is being “hijacked” for a different topic.
Digital Humanities: For historians and linguists, this offers a computational way to study “Reception Theory”—tracking how the perception of a literary work (like Hamlet or the Bible) changes over centuries.
Dog Whistles: This framework could eventually help identify “dog whistles”—phrases that seem innocent in general context (high similarity) but carry a specific, different meaning to a subgroup (low topic relatedness to the mainstream usage).

By formalizing the study of recontextualization, the researchers have opened a new door for NLP to move beyond identifying what was said, to understanding how it is being used.

Introduction#

The Problem: Detection vs. Understanding#

The TROTR Framework#

1. Text Reuse in-Context (TRiC)#

2. Topic Variation Ranking across Corpus (TRaC)#

The Benchmark Data: Bible Tweets#

Experimental Setup: SBERT Models#

The “Masking” Strategy#

Results and Analysis#

1. Bi-Encoders Outperform Cross-Encoders#

2. The Failure of Pre-Trained Models#

3. The “Similarity Bias” and the Power of Masking#

4. Results on Corpus-Level Variation (TRaC)#

Conclusion and Implications#

Why Does This Matter?#