Language is a moving target. Words like “plane” or “mouse” mean something very different today than they did two hundred years ago. To teach computers how to understand these shifts—a field known as Lexical Semantic Change Detection (LSCD)—researchers need high-quality data. They need a way to map how a word is used in thousands of different contexts.

Enter the Word Usage Graph (WUG). This innovative approach moves away from rigid dictionary definitions and instead relies on how words relate to one another in actual sentences.

In this post, we dive deep into the paper “More DWUGs,” where researchers took a critical look at the largest existing WUG datasets. They didn’t just analyze them; they expanded them with thousands of new human judgments and rigorously tested their reliability. The results offer a roadmap for building better semantic datasets: it turns out that having more connections is often better than having more examples.

The Problem with Dictionaries

Traditionally, if you wanted to teach a computer the senses of a word, you would rely on a Use-Sense annotation model.

In this model, a human annotator looks at a sentence:

“She opened a vein in her little arm.”

And chooses the best definition from a pre-defined list:

A human limb
A weapon system

This seems straightforward, but it has major drawbacks. It requires a pre-existing sense inventory (a dictionary), which might be outdated, incomplete, or biased. It essentially forces new data into old boxes.

The Alternative: Use-Use Annotation

The WUG paradigm flips this on its head. Instead of matching a sentence to a definition, annotators simply compare two sentences (uses) and judge their semantic proximity.

Use 1: “…taking a knife, she opened a vein in her arm.” Use 2: “He stood overlooking an arm of the sea.”

The annotator rates these on a scale: Are they identical? Related? Or completely unrelated?

When you do this for hundreds of sentences, you create a network. Sentences with similar meanings cluster together, while unrelated meanings drift apart. This forms a Word Usage Graph (WUG).

Figure 1: WUGs from DWUG V1 (SemEval clustering) and DiscoWUG V1: English plane (left),Swedish firg (middle) and German anpflanzen (right). Isolates were removed.

As shown in Figure 1 above, these graphs visually represent the “semantic space” of a word.

Nodes (dots): Individual instances of a word in a sentence.
Edges (lines): The semantic relationship between them.
Clusters (colors): Distinct senses of the word (e.g., the “airplane” sense of plane vs. the “geometric surface” sense).

However, the researchers identified a problem with the existing datasets (specifically from SemEval-2020). Because annotating every possible pair of sentences is impossible (the number of pairs grows quadratically), the graphs were sparse. There weren’t enough edges connecting the nodes. This paper sets out to fix that by adding rounds of annotation and testing if “denser” graphs tell a truer story.

Core Method: Constructing the Graph

To understand the paper’s contribution, we first need to understand the machinery behind a Word Usage Graph.

1. The Graph Structure

Formally, a WUG is defined as a graph $G = (U, E, W)$.

$U$: The set of word uses (sentences).
$E$: The edges connecting them.
$W$: The weights of those edges, representing semantic proximity.

The researchers used the DURel relatedness scale, where humans judge pairs on a scale of 1 (Unrelated) to 4 (Identical).

2. Clustering: Finding the Meaning

Once the graph is built, how do we find the senses? We can’t just look at it; we need an algorithm to identify the clusters. The authors use Correlation Clustering.

The goal of this algorithm is to group nodes such that:

Nodes within the same cluster are connected by high (positive) weights.
Nodes in different clusters are connected by low (negative) weights.

To do this, the algorithm transforms the 1–4 human ratings. They set a threshold ($h=2.5$).

Ratings $> 2.5$ become positive weights (attraction).
Ratings $< 2.5$ become negative weights (repulsion).

The clustering algorithm tries to minimize the Sum of Weighted Disagreements (SWD). A “disagreement” happens if two positively connected nodes are put in different clusters, or if two negatively connected nodes are put in the same cluster.

$()\n\\mathit { S W D } ( C ) = \\sum _ { e \\in \\phi _ { E , C } } W ^ { \\prime } ( e ) + \\sum _ { e \\in \\psi _ { E , C } } | W ^ { \\prime } ( e ) | \\ .\n[$

In this equation:

The first sum represents the penalty for separating nodes that should be together (positive edges across clusters).
The second sum represents the penalty for grouping nodes that should be apart (negative edges within clusters).

By minimizing this value, the algorithm automatically detects how many senses (clusters) a word has, without needing a dictionary.

3. Measuring Semantic Change

Once the clusters are defined, the researchers can measure how a word changes over time. They look at the Cluster Frequency Distribution ($D$).

$]\nD = ( f ( L _ { 1 } ) , f ( L _ { 2 } ) , . . . , f ( L _ { i } ) )\n[$

This distribution simply counts how often the word appears in each cluster (sense) during a specific time period.

To quantify the change between two time periods (say, 1800s vs. 1990s), they compare the distributions ($P$ and $Q$) using the Jensen-Shannon Distance (JSD):

$]\n\\sqrt { \\frac { D ( P | M ) + D ( Q | M ) } { 2 } }\n[$

A high JSD score means the distributions are very different—indicating that the word has undergone significant semantic change.

Extending the Dataset

The original SemEval dataset (DWUG V1) was created in four rounds of annotation. The researchers in this paper extended this effort significantly to create DWUG V2 and V3.

They added Round 5 and Round 6, employing specific sampling strategies:

The Unconnected Heuristic: They specifically targeted clusters that had not yet been compared to one another. This helps bridge gaps in the graph.
The Random Heuristic: Sampling edges completely at random to avoid bias.

In total, they added thousands of judgments across English (EN), German (DE), and Swedish (SV).

The “Resampled” Dataset

To test reproducibility, they also created a completely new dataset from scratch. They selected 15 words per language, sampled new sentences (uses) from the source corpora, and annotated them. Crucially, this “Resampled” dataset used fewer nodes (50 uses instead of ~180) but annotated them much more densely. This allows for a comparison: Is it better to have many uses with few connections, or fewer uses with many connections?

Experiments & Results

The paper asks three critical questions about the WUG paradigm: Validity, Robustness, and Replicability.

1. Validity: Do More Rounds Improve Quality?

Does adding more annotation rounds actually make the clusters “truer”? To test this, the researchers compared their automatically derived clusters against a “Gold Standard”—traditional Use-Sense annotations where experts manually assigned dictionary definitions to the sentences.

They used the Adjusted Rand Index (ARI) to measure the agreement between the WUG clusters and the human gold standard.

$]\nA R I = \\frac { R I - E x p e c t e d _ { R I } } { m a x ( R I ) - E x p e t e d _ { R I } }\n()$

The results were clear:

$Figure 2: Left: ARI of DWUG DE clusters after each round vs.DWUG DE Sense annotation.Spreads indicate variation over lemmas \$( \\mathrm { N } { = } 1 9 )\$ ; only lemmas appearing in all WUG datasets and the sense annotation dataset are included. Right: ARI of DWUG DE/EN/SVclusters vs. V3.Spreads indicate variation over lemmas.Only lemmas that were annotated in each round are included.Uses that were assigned to the noise cluster round 6 were excluded from the ARI computation.$

In the plot above (Left), we see the German dataset (DE). The ARI score increases steadily from Round 1 to Round 5. This proves that the earlier, sparser graphs were not capturing the full picture. The clustering improves significantly as more edges are added.

The plot on the Right compares previous rounds against the final Version 3 (V3) dataset. Across all languages (DE, EN, SV), the earlier rounds show lower agreement with the final result, confirming that the additional annotation effort was necessary to reach a stable state.

2. Robustness: Handling Noise

Real-world data is messy. Annotators make mistakes. The researchers tested the robustness of their graphs by injecting random “noise” (incorrect annotations) into the data and seeing if the clustering algorithm fell apart.

$Figure 3: Robustness- ARI scores computed withrespect to increasing percentages of noisy edges.The right y-axis (in red) shows the raw number of noisy edges. The \$\\mathbf { X }\$ -axis shows the percentage of perturbed edges.$

The results (Figure 3) reveal a crucial insight about graph density.

V1 (Blue lines): The original, sparse datasets are very fragile. Even a small percentage of noise causes the ARI (accuracy) to drop sharply.
Resampled (Bottom row): Look at the stability here. The “Resampled” datasets have fewer nodes but are much more densely connected. Even when 40% of the edges are noise, the clustering structure holds up reasonably well (especially in German and Swedish).

Takeaway: A smaller, densely connected graph is more robust than a large, sparsely connected one.

3. Replicability and Convergence

Finally, the authors analyzed how quickly the semantic change scores converged. If you are analyzing a word, how much data do you actually need before you know if the word has changed meaning?

The researchers found that this depends heavily on the “entropy” of the word—essentially, whether it is monosemous (one meaning) or polysemous (multiple meanings).

Figure 7: Approximation of final semantic change score in resamp led datasets considering increasing percentage of edges.The y-axis shows the absolute difference in change score computed at each step.

Figure 7 shows the error rate (y-axis) as more edges are added (x-axis).

Blue Line (Low Entropy/Monosemous): The error drops to near zero very quickly. You don’t need much data to confirm a word hasn’t changed.
Orange Line (High Entropy/Polysemous): These words are harder. The error stays higher for longer. You need a lot more annotation data to accurately map the complex shifts in meaning for words with multiple senses.

Conclusion and Implications

The “More DWUGs” paper is a significant step forward for the field of computational linguistics. It transitions the Word Usage Graph method from an experimental novelty to a rigorously validated scientific tool.

Key Takeaways:

More is Better: Extending the datasets with rounds 5 and 6 significantly improved cluster validity compared to human gold standards.
Density Over Size: The experiments suggest that future annotation efforts should sacrifice the number of uses in favor of the number of edges. A dense graph of 50 sentences is more scientifically useful than a sparse graph of 200.
Reliability: The updated datasets (V3) provide a reliable ground truth for training and evaluating AI models on semantic change.

For students and researchers, this paper highlights the importance of looking under the hood of dataset creation. It shows that “big data” isn’t always enough; the structure of that data—how well it is connected—determines whether we are seeing a true signal or just noise.

The Problem with Dictionaries#

The Alternative: Use-Use Annotation#

Core Method: Constructing the Graph#

1. The Graph Structure#

2. Clustering: Finding the Meaning#

3. Measuring Semantic Change#

Extending the Dataset#

The “Resampled” Dataset#

Experiments & Results#

1. Validity: Do More Rounds Improve Quality?#

2. Robustness: Handling Noise#

3. Replicability and Convergence#

Conclusion and Implications#