If you have ever caught yourself finishing someone else’s sentence, you intuitively understand that language processing is predictive. When we read or listen, we don’t just passively receive words; our brains actively anticipate what comes next based on the context.
In the field of psycholinguistics, this phenomenon is formalized as Surprisal Theory. The core tenet is simple yet powerful: the processing effort required for a word (often measured by how long our eyes linger on it) is proportional to its “surprisal”—or how unexpected it is given the preceding context. A highly predictable word is processed quickly; a surprising word causes a stutter in our cognitive flow, leading to longer reading times.
For years, this theory has been the bedrock of reading time prediction, supported by studies using everything from N-gram models to modern Transformers. But a new paper by Opedal et al., titled “On the Role of Context in Reading Time Prediction,” challenges us to take a closer look.
The researchers identify a critical confounding factor: Frequency. Rare words tend to be surprising, and common words tend to be predictable. If we don’t carefully disentangle the inherent frequency of a word from its contextual predictability, are we overestimating the brain’s reliance on context?
In this deep dive, we will explore the mathematical framework proposed by the authors to separate these variables. We will look at how they use Hilbert space projections to create “orthogonalized” predictors and what their empirical results tell us about how humans actually process language.
Part 1: The Usual Suspects of Reading Difficulty
Before we dismantle the current models, we need to understand the variables at play. When a researcher tries to predict how long a student will stare at the word “apple” in a sentence, they typically rely on two main probabilistic concepts derived from Language Models (LMs).
1. Frequency (Unigram Surprisal)
The most fundamental property of a word is how often it appears in the language, regardless of context. “The” is frequent; “defenestration” is not.
In information theory, we often express frequency as unigram surprisal. If \(q_H\) is a unigram language model (a model that treats words as independent events), the unigram surprisal is the negative log probability of the word:

In this paper, the authors refer to this quantity simply as frequency. It is well-established that humans read high-frequency words faster than low-frequency words. This is a “context-free” effect.
2. Contextual Surprisal
This is the star of the show. Contextual surprisal measures how unexpected a word is given the history of words that came before it. If \(p_H\) is a human-like language model, the surprisal of a unit \(\bar{u}\) given context \(\mathbf{c}\) is:

Surprisal theory posits that reading time is an affine function (a linear relationship) of this value.
The Problem: Collinearity
Here lies the problem. Frequency and Contextual Surprisal are not independent.
- Correlation: Rare words (high unigram surprisal) are often hard to predict in context (high contextual surprisal).
- Collinearity: In linear regression models, when two predictor variables are highly correlated, it becomes difficult to determine which variable is actually driving the effect.
If you build a model that says “Contextual Surprisal predicts reading time,” but Contextual Surprisal is 50% just Frequency in disguise, have you proven that context matters, or just re-proven that frequency matters?
Part 2: A New Perspective—Pointwise Mutual Information
To tackle this, the authors introduce a third character to the story: Pointwise Mutual Information (PMI).
PMI measures the association between a word and its context. It asks: “How much more likely is this word in this specific context compared to its general frequency?”
Mathematically, PMI is defined as:

This looks complex, but it simplifies beautifully. PMI is actually just the difference between Frequency and Surprisal:

This identity (Equation 10) is crucial. It reveals that Frequency (\(v_H\)), Surprisal (\(\iota_H\)), and PMI (\(\mu_H\)) are linearly dependent. You can construct any one of them if you have the other two.
The Linear Model Equivalence
Why does this matter for research? Many studies use linear regression to validate surprisal theory. They set up a model where:
\[ \text{Reading Time} \approx \text{Frequency} + \text{Surprisal} \]The authors prove that because of the linear relationship shown above, this model is mathematically equivalent to:
\[ \text{Reading Time} \approx \text{Frequency} + \text{PMI} \]This means that any empirical evidence supporting Surprisal theory is also evidence supporting a “PMI theory” (an association-based view of processing). The coefficients change, but the predictive power remains exactly the same. We cannot distinguish between “prediction” (Surprisal) and “association” (PMI) using standard linear models that control for frequency.
Part 3: Disentangling Context via Orthogonalization
Since standard regression cannot separate these effects, the authors propose a geometric solution. They treat these probabilistic measures (Frequency, Surprisal, PMI) as random variables living in a Hilbert Space.
Without getting too bogged down in functional analysis, think of a Hilbert Space as a vector space that allows us to measure angles and lengths. In this space:
- Vectors: Our predictors (Frequency, Surprisal) are vectors.
- Inner Product: The correlation between variables acts like the angle between vectors.
Because Surprisal and Frequency are correlated, their vectors point in somewhat similar directions. The authors want to isolate the “pure” context part of Surprisal—the part that has absolutely nothing to do with Frequency.
The Projection
To do this, they perform a projection. They project the Surprisal vector onto the “orthogonal complement” of the Frequency vector.
In simple terms: They mathematically subtract the Frequency vector from the Surprisal vector, leaving only the component of Surprisal that is perpendicular (uncorrelated) to Frequency.
The formula for this Orthogonalized Surprisal predictor is:

Here:
- \(\mathbf{I}_H\) is the Surprisal vector.
- \(\mathbf{Y}_H\) is the Frequency vector.
- The fraction represents the covariance (shared information) between them.
The result is a new predictor that represents context alone. By forcing the correlation with frequency to be zero, the authors can now ask: “How much does context actually explain when it can no longer borrow explanatory power from frequency?”
Part 4: The Experiments
To test this, the researchers used the Multilingual Eye-movement Corpus (MECO). This dataset tracks the eye movements of participants reading Wikipedia-style articles in 13 different languages.
They focused on Gaze Duration—the total time a reader’s eyes fixate on a word during the first pass.
They set up three competing linear models to predict gaze duration:
- Standard Model: Frequency + Standard Surprisal + Word Length.
- PMI Model: Frequency + PMI + Word Length.
- Orthogonal Model: Frequency + Orthogonalized Surprisal + Orthogonalized Word Length.
Note: They also included “Word Length” as a predictor because longer words naturally take longer to read. They orthogonalized length as well to ensure purity of the variables.
The Metric: Explained Variance (LMG)
Rather than just looking at accuracy, they used a technique called LMG (Lindeman, Merenda, and Gold). LMG decomposes the \(R^2\) (total variance explained) to show exactly how much “credit” each predictor deserves.
The Results
The results, visualized below in Figure 1, were striking.

Let’s break down what this chart shows:
- The Columns: Each group represents a language (Dutch, English, Finnish, etc.).
- The Colors:
- Red: Frequency (Unigram Surprisal).
- Blue: Word Length.
- Green: Context (The variable we are testing).
- The Rows:
- Top Row (Surprisal): This uses the standard measure. Notice the Green bars are visible, suggesting context matters.
- Middle Row (PMI): Uses PMI. The Green bars shrink slightly.
- Bottom Row (Orthogonalized): This is the crucial test. When using the disentangled, orthogonalized surprisal.
The Observation: Look at the Green bars in the bottom row. They are tiny.
When the shared variance between frequency and context is assigned to frequency (which is logical, as frequency is the simpler, context-independent property), the remaining effect of context is very small.
In almost every language (except Korean), Frequency (Red) and Length (Blue) dominate the prediction. The “pure” contextual information explains a very small proportion of the variance in reading times.
Robustness Check: Non-Linear Models
One might argue that the relationship isn’t linear. Maybe the brain processes context in a complex, curvy way?
To address this, the authors ran Generalized Additive Models (GAMs), which allow for non-linear relationships. They calculated the “Delta Log Likelihood”—essentially a measure of how much better the model gets when you add a specific predictor.

Figure 4 confirms the linear findings.
- Look at the single-predictor columns on the left. Frequency (Freq) is consistently the strongest individual predictor of reading time.
- Surprisal (Surp) and PMI are decent, but Orthogonalized Surprisal (Osurp) is the weakest.
- On the right side, when you combine predictors (e.g., “Surp & Freq”), there is almost no difference between the models.
This reinforces the idea that “Standard Surprisal” was performing well largely because it contained Frequency information. Once you strip that away (Osurp), it becomes a weak predictor.
Part 5: Implications and Conclusion
This paper provides a sobering “sanity check” for psycholinguistics and NLP.
1. Context is overrated (statistically)
This doesn’t mean context is irrelevant. Obviously, if you see the word “bank” after “river,” your brain processes it differently than after “money.” However, in terms of raw processing time for general reading, the inherent properties of the word (how long it is, how rare it is) do the heavy lifting. The subtle acceleration provided by context is a much smaller slice of the pie than previously thought.
2. Validating the “Good Enough” Approach
The results lend support to “shallow processing” or “good enough” theories of language comprehension. We might rely heavily on heuristics (like frequency) and only engage deep, predictive contextual processing when necessary.
3. The LLM Disconnect
Interestingly, this helps explain a recent puzzle in AI research. Larger, more powerful Large Language Models (LLMs) often provide worse fits for human reading times than smaller models.
Why? Large LLMs are masters of context. They can predict rare words with uncanny accuracy based on long histories. But humans are not that good. By being too perfect at contextual prediction, large LLMs minimize the surprisal of rare words, diverging from the human experience where a rare word is still a hurdle, regardless of context.
Summary
Opedal et al. have given us a new mathematical lens—Hilbert space projection—to view an old problem. By rigorously separating the influence of frequency from context, they demonstrated that the “predictive mind” might be more of a “frequentist mind” than we realized.
For students of cognitive science and NLP, the lesson is clear: correlation is not causation, and in the soup of language statistics, variables are rarely independent. To truly understand how we read, we have to pull them apart, even if the result is a little less surprising than we hoped.
](https://deep-paper.org/en/paper/2409.08160/images/cover.png)