Introduction
In the world of Natural Language Processing (NLP), word embeddings are the bedrock of modern semantic understanding. From the early days of Word2Vec to the transformer revolution, the core idea remains the same: we map words to dense vectors of real numbers. If the vector for “king” minus “man” plus “woman” equals “queen,” we celebrate the model’s ability to capture meaning.
However, there is a fundamental flaw in how we traditionally use these embeddings in computational social science and downstream tasks. We treat them as point estimates. When a model tells us that the vector for “politics” is at coordinates \([0.23, -0.91, \dots]\), we accept this precise location as absolute truth.
But intuitively, we know this cannot be the full story. If a model learns the word “the” from millions of examples, it should be very confident in that vector’s location. Conversely, if it learns a rare word like “defenestration” from only a handful of context, the resulting vector should come with a massive grain of salt. Standard embedding models like GloVe do not provide this “grain of salt.” They give us the position, but not the uncertainty.
This lack of statistical rigor becomes dangerous when we use embeddings to make scientific claims about bias, historical linguistic shifts, or semantic similarity. Are two words actually similar, or is it just noise in the training data?
In this post, we will deep dive into GloVe-V, a method introduced by Vallebueno et al. (2024) that brings statistical uncertainty to the popular GloVe embedding model. We will explore how they reformulated the mathematics of GloVe to derive variance estimates, and why this matters for everything from finding synonyms to detecting gender bias.
Background: The Deterministic Nature of GloVe
To understand GloVe-V, we first need to revisit how the original GloVe (Global Vectors for Word Representation) model works.
GloVe is a count-based model. It relies on a global co-occurrence matrix \(\mathbf{X}\), where an entry \(X_{ij}\) represents how many times word \(j\) appears in the context of word \(i\). The goal of GloVe is to learn word vectors such that their dot product equals the logarithm of their probability of co-occurrence.
The standard objective function for GloVe is to minimize the following weighted least squares cost:

Here is a breakdown of the components:
- \(\mathbf{w}_i\) and \(\mathbf{v}_j\): The “center” and “context” word vectors we want to learn.
- \(b_i\) and \(c_j\): Scalar bias terms for the respective words.
- \(\mathbf{X}_{ij}\): The raw count of co-occurrences.
- \(f(\mathbf{X}_{ij})\): A weighting function that prevents common words from dominating the objective and rare words from being ignored.
Typically, this optimization is solved using stochastic gradient descent. Once the training is done, you get a single vector \(\mathbf{w}_i\) for every word. If you ran the training again on slightly different data, or even with a different random seed, you might get a different vector. Yet, in application, we act as if that single \(\mathbf{w}_i\) is the only possible representation of the word.
This is the gap GloVe-V aims to fill. The researchers propose a way to estimate the variance (or reconstruction error) of these vectors, effectively turning a point estimate into a probability distribution.
The Core Method: Deriving GloVe-V
The brilliance of GloVe-V lies in how it reinterprets the existing GloVe machinery rather than inventing a completely new architecture. The authors realized that if you hold certain parts of the model fixed, the math transforms into a statistically tractable form.
1. Reformulating the Optimization
The authors start by rewriting the GloVe optimization problem in matrix form. The original problem is a weighted low-rank approximation. They propose solving it in a specific two-step conceptual manner to expose the statistical properties.

This equation looks intimidating, but it describes a “block coordinate descent” approach.
- Outer Minimization: Find the optimal bias terms (\(\mathbf{b}, \mathbf{c}\)) and context vectors (\(\mathbf{V}\)).
- Inner Minimization: Holding those variables fixed at their optimal values, find the optimal center word vectors (\(\mathbf{W}\)).
By assuming the context vectors (\(\mathbf{V}^*\)) and bias terms are fixed at their globally optimal values, the problem of finding a specific word vector \(\mathbf{w}_i\) decouples from the rest of the vocabulary. It becomes a series of independent weighted least squares problems.
The solution for a single optimal word vector \(\mathbf{w}_i^*\) can then be written analytically:

This equation tells us that the word vector is essentially a projection of the log co-occurrences onto the subspace defined by the context vectors.
2. The Probabilistic Model
This is the crucial pivot point. Once the authors established the analytical form of the word vector (above), they could work backward to define a probabilistic model that justifies it.
They posit that the log co-occurrences (\(\log \mathbf{x}_i\)) are generated from a Multivariate Normal Distribution.

In this model:
- The “mean” is determined by the dot product of the word and context vectors (plus biases).
- The “noise” or error term \(\mathbf{e}_i\) follows a normal distribution centered at zero.
- The variance of this noise is scaled by the inverse of the weights \(\mathbf{D}_{\mathcal{K}}^{-1}\) and a word-specific reconstruction error \(\sigma_i^2\).
This effectively treats the observed co-occurrence counts as “noisy” observations of a true underlying semantic relationship.
3. Estimating the Variance
If the word vector \(\mathbf{w}_i\) is the result of estimation on this noisy data, then \(\mathbf{w}_i\) itself is a random variable with its own covariance matrix. Using standard statistical theory for weighted least squares, the authors derive the covariance matrix \(\boldsymbol{\Sigma}_i\) for a word \(i\):

This matrix \(\boldsymbol{\Sigma}_i\) defines the shape of the uncertainty around the word vector. It creates an ellipsoid in the vector space.
- If the reconstruction error is low and the word is well-supported by context, the “cloud” of uncertainty is small.
- If the data is sparse or noisy, the cloud is large.
To actually calculate this, we need the scalar reconstruction error variance \(\sigma_i^2\). This can be estimated empirically from the data using a “plug-in” estimator:

This formula effectively sums up the squared errors (the difference between the model’s prediction and the actual log count), weighted by \(f(\mathbf{X}_{ij})\), and normalized by the number of observations (\(|\mathcal{K}|\)) minus the vector dimensions (\(D\)).
Visualizing the Method
To tie all these mathematical steps together, the authors provide a conceptual diagram that contrasts the original GloVe approach with GloVe-V.

As shown in the image above:
- GloVe (Top): Focuses on finding the vector that minimizes the difference between the dot product and the log count.
- GloVe-V (Bottom): Uses the reconstruction error from that minimization to construct a Normal distribution around the vector.
This method assumes that the rows of the co-occurrence matrix are conditionally independent given the optimal context vectors. While this is a simplifying assumption, it is what allows GloVe-V to be computationally scalable to large vocabularies—a massive advantage over computationally expensive methods like bootstrapping.
Experiments and Analysis
So, we have a mathematical way to assign a “cloud of uncertainty” to every word vector. What does this actually look like in practice? The authors trained GloVe-V on the Corpus of Historical American English (COHA) to demonstrate the utility of their method.
1. Visualizing Word Uncertainty
The most immediate result is that we can now “see” the uncertainty. By projecting the 300-dimensional vectors and their covariance matrices into 2D, we can plot the confidence ellipses for different words.

In Figure 2, notice the contrast between words like “she” and “large” versus “rigs” and “illumination.”
- “she”: A high-frequency word. The model has seen it in thousands of contexts. The ellipse is tiny, effectively a point. We are very certain about its location.
- “illumination”: A lower-frequency word. The ellipse is massive. The model knows roughly where it is, but the statistical uncertainty implies it could be anywhere within that region.
This relationship is not random. There is a strong, predictable correlation between word frequency and variance.

Figure 3 confirms that as word frequency increases (x-axis), the variance (y-axis) drops significantly. This validates the intuition that data sparsity drives uncertainty.
2. GloVe-V vs. Document Bootstrap
A common critique might be: “Why not just bootstrap?” Bootstrapping involves resampling the documents in your corpus 100 times, training 100 different models, and measuring the variance.
The authors argue that document bootstrapping measures something different: sampling variability (uncertainty due to which documents were included). GloVe-V measures reconstruction uncertainty (uncertainty due to sparsity in word co-occurrences).

As Figure 4 shows, the two methods yield different standard errors for cosine similarity. GloVe-V (red line) tends to report higher uncertainty, particularly for lower frequency words.
- Efficiency: Calculating GloVe-V requires training one model and running one analytical pass. Bootstrapping requires training \(N\) models (e.g., 100x the compute).
- Granularity: GloVe-V captures the specific noise of a word’s co-occurrence profile, whereas document bootstrapping can sometimes underestimate variance if a rare word happens to appear consistently in the few documents it inhabits.
3. Application: Nearest Neighbors
One of the most common tasks in NLP is finding the “nearest neighbors” of a word—usually to find synonyms or related concepts. We calculate the cosine similarity between “doctor” and a list of candidates and rank them.
But is the #1 match actually statistically different from the #2 match?

Figure 5 illustrates this perfectly. The points represent the standard cosine similarity estimates. “Surgeon” is the closest neighbor to “doctor,” followed by “dentist.” However, look at the error bars provided by GloVe-V. The 95% confidence intervals for “surgeon,” “dentist,” and “psychiatrist” overlap significantly.
- Conclusion: We cannot statistically reject the null hypothesis that “dentist” is actually closer to “doctor” than “surgeon” is.
- Significance: This suggests that many “rankings” in NLP papers are spurious. Without variance, we are reading too much into noise.
4. Application: Bias Detection
Perhaps the most impactful application of GloVe-V is in the study of societal biases in text. Researchers often use word embeddings to measure gender or ethnic bias by comparing the distance between demographic words (e.g., surnames) and attribute words (e.g., “career” vs “family”).
A major issue in this field is relying on specific lists of words (lexicons). If a researcher includes rare surnames to be more “inclusive,” they might introduce massive noise.

Figure 7 (left panel) shows a measurement of anti-Asian bias using surnames.
- The points show bias scores for different subsets of surnames based on frequency.
- Rare surnames (Q1) show a very different mean bias than frequent surnames (Q4).
- If a researcher only used the most frequent names (like “Gandhi” or “Mao,” which appear often in history corpora), they would calculate a high bias score (above the zero line).
- However, the shaded gray area shows the GloVe-V uncertainty for the entire list. It spans across zero.
- Takeaway: When accounting for uncertainty, the evidence for bias in this specific corpus/timeframe is much weaker than a point estimate using famous names would suggest. GloVe-V allows researchers to integrate all names, weighted by their certainty, rather than arbitrarily cutting lists.
Figure 7 (right panel) compares different types of gender bias (Science vs. Arts, etc.). While the point estimates differ, the error bars allow us to see which differences are statistically significant (p-values) and which are not.
Conclusion and Implications
The introduction of GloVe-V marks a significant step toward “scientific” NLP. For too long, the field has operated on the assumption that vector representations are exact. As we have seen, this assumption fails exactly where it matters most: with rare words and subtle semantic distinctions.
By reformulating the GloVe objective to expose its probabilistic nature, the authors have provided a tool that is:
- Scalable: It does not require retraining the model hundreds of times.
- Principled: It is derived directly from the reconstruction error of the matrix factorization.
- Actionable: It allows for hypothesis testing (\(p\)-values) on downstream tasks like similarity ranking and bias auditing.
For students and researchers, the lesson is clear: Look at the variance. If you are building a system that relies on the distance between word vectors, you must ask how reliable those vectors are. With GloVe-V, you no longer have to guess.
Future work may extend these probabilistic derivations to other architectures, such as Transformer-based contextual embeddings (BERT, GPT), which currently suffer from the same “point estimate” blind spot. Until then, GloVe-V offers a robust template for how to think about uncertainty in vector space.
](https://deep-paper.org/en/paper/2406.12165/images/cover.png)