Language is rarely a flat line. When we tell a story, write an essay, or engage in a long conversation, we naturally vary the complexity and unpredictability of our words. Sometimes we are concise and dense; other times we are repetitive and explanatory.
For decades, a dominant theory in psycholinguistics called the Uniform Information Density (UID) hypothesis has suggested that speakers unconsciously strive for the opposite: a perfectly even distribution of information. The idea is that to communicate efficiently, we shouldn’t overwhelm the listener with too much information at once, nor should we bore them with too little.
But if you look at a graph of information content over the course of a novel or a news article, it looks nothing like a flat line. It looks like a mountain range—full of peaks and valleys.
In a fascinating paper titled “Surprise! Uniform Information Density Isn’t the Whole Story,” researchers from ETH Zürich propose a new way to understand these fluctuations. They introduce the Structured Context Hypothesis, suggesting that the “rhythm” of information flow isn’t random noise. Instead, it is dictated by the hierarchical structure of the discourse itself—from the paragraph level down to the deep rhetorical relationships between clauses.
In this post, we will break down this research, explore how “surprisal” is measured, and discover how the hidden architecture of our arguments shapes the flow of information.
The Baseline: Surprisal and Uniformity
To understand why the researchers are proposing a new hypothesis, we first need to understand the metric they are analyzing: Surprisal.
In information theory, the “information content” of a word is often measured by how unexpected it is in a given context. If you read the phrase “The cat sat on the…”, the word “mat” has very low surprisal (and low information) because you expect it. The word “photosynthesis,” however, would have extremely high surprisal.
Mathematically, we define Shannon surprisal as the negative log probability of a unit (\(u_t\)) given its preceding context (\(u_{ This metric posits that the difficulty of processing a word is directly related to its unpredictability. The Uniform Information Density (UID) hypothesis treats language communication like a digital signal moving through a noisy channel (like a wire). To maximize efficiency without causing errors (misunderstanding), you want to send data at a constant rate that is close to the channel’s capacity. If UID were a hard rule, every word you speak would add roughly the same amount of new information. Speakers would use longer words or more filler when the concept is complex, and shorter, denser words when the concept is simple, effectively smoothing out the “information contour.” While UID holds up well at the level of individual sentences or word choices, it fails to explain the macroscopic view of long-form text. When we analyze a whole document, the information rate fluctuates significantly. Take a look at the graph below. This is an “information contour” of a document from the Wall Street Journal. The blue line represents the document surprisal (information content) calculated by a language model. It is jagged and volatile. Even when we smooth it out using a rolling average (the green line), distinct waves and trends emerge. The authors of this paper argue that these fluctuations are not just “theoretical noise.” They are a feature, not a bug. They propose that we modulate information rate based on where we are within the hierarchical structure of the discourse. The core contribution of this paper is the Structured Context Hypothesis. The researchers posit that language production is subject to functional pressures beyond just keeping information constant. Speakers and writers organize their thoughts into hierarchies—arguments inside paragraphs, paragraphs inside sections, and clauses inside sentences. The hypothesis states: The information contour of a discourse is (partially) determined by the hierarchical structure of its constituent discourse units. To test this, the researchers compared two ways of viewing text structure: While prose structure is familiar to us (we learn about paragraphs in elementary school), RST requires a bit more explanation. RST views a text as a tree of relationships. In an RST tree, text is broken down into Elementary Discourse Units (EDUs), which are usually clauses. These units are connected by rhetorical relations like Elaboration, Contrast, or Attribution. Crucially, RST distinguishes between: Here is what an RST tree looks like for a single sentence: In this example, the phrase “That is in part because of the effect” serves as a central anchor. The clause “she said” is an Attribution (telling us who said it), and the phrase “of having to average…” is an Elaboration (giving us more detail). The researchers hypothesized that this deep, recursive structure is a better predictor of information flow than simple paragraph breaks. To prove that structure dictates information flow, the authors set up a regression analysis. Their goal was to see if adding structural data allows a model to predict the “surprisal” of the next word better than a baseline model. They calculated the “ground truth” surprisal of texts from the Wall Street Journal (English) and a specialized corpus of Spanish texts using powerful Large Language Models (LLMs) like Llama-2 and Mistral. They looked at several variations of surprisal: Global (Document) Surprisal: The surprisal of a word given the entire preceding document.
Rolling Averages: To smooth out local noise (like the difference between “the” and “cat”), they averaged surprisal over windows of 3, 5, and 7 tokens. Pointwise Mutual Information (PMI): This measures how much the context actually helps. It is the difference between how surprising a word is on its own versus how surprising it is when you know the context.
This is the creative heart of the paper. How do you turn a “tree” structure into numbers that a regression model can use? The authors devised several clever features: Relative Position: Where is this word in the paragraph or sentence? (e.g., “50% through the sentence”). Nearest Boundary: How close is the word to the start or end of a unit? Hierarchical Position: How deep is this unit nested within the overall document tree? Transition Predictors (Parsing Moves): This is the most complex predictor. They simulated a parser traversing the discourse tree (moving Top-Down, Bottom-Up, or Left-Corner) and counted the “Push” and “Pop” operations required to reach each word. The image below illustrates these parsing strategies. The number of “pops” (moving up the tree) or “pushes” (moving down) acts as a proxy for the structural complexity leading up to a specific word. The table below summarizes the extensive list of variables used in the experiment. The “Baseline” predictors are simply the length of the word and the surprisal of the previous word—factors we already know influence predictability. The question is: do the Structural variables add anything new? The researchers ran Bayesian linear regressions to see which features best predicted the actual information contours of the texts. They measured success by \(\Delta\) MSE—the reduction in Mean Squared Error compared to the baseline. A negative bar in the charts below means the model got better (lower error). Here is what they found. The first major finding is that knowing where a word sits within a discourse unit significantly helps predict its information content. In the chart below (Figure 4), look at the orange bars (“Relative position”). In both English and Spanish, relative position provides a massive boost in predictive power across almost all metrics. This confirms that information isn’t uniform; it evolves systematically as we move through a sentence or paragraph. The “Hierarchical position” (green bars in Figure 4) also performed exceptionally well, particularly for Spanish document surprisal. This suggests that the “depth” of a sentence in the overall argument structure influences how information-dense it is likely to be. Interestingly, the “Parsing transitions” (red bars)—the complex measure of tree traversal steps—were generally the weakest predictors. While they still helped more than the baseline, simple positional metrics were more effective. The ultimate showdown was between the two types of structure: RST (Deep Discourse) vs. Prose (Paragraphs/Sentences). Does the complex linguistic tree of RST actually explain more than standard writing conventions? Yes. The chart below (Figure 5) compares RST predictors against Prose Structure (PS) predictors for English. Notice that for Document surprisal (the top cluster), the RST predictors (orange/green/red) generally show stronger negative values (better performance) than their Prose Structure counterparts (the lighter bars). The “RST all” model (darkest orange) is the clear winner. This indicates that the “hidden” rhetorical structure of a text—whether a clause is an explanation, a contrast, or a summary—modulates information flow more tightly than simply starting a new paragraph. The researchers validated their findings on a Spanish corpus as well. As shown in Figure 6, the trends are remarkably similar. RST structure continues to outperform Prose Structure, suggesting that this “Structured Context” phenomenon isn’t just an English quirk—it likely reflects a fundamental property of how humans organize information in Western written languages. (Additional detailed results for other metrics like PMI are available in the supplementary data, showing consistent trends where structure aids prediction.) The data proves that information density fluctuates in rhythm with discourse structure. But why? If UID suggests efficiency is king, why do we tolerate these peaks and valleys? The authors offer several compelling theoretical reasons that go beyond the scope of simple efficiency: The “Uniform Information Density” hypothesis has been a guiding light in linguistics, explaining why we shorten common words and expand rare ones. However, this research shows that UID is a baseline, not the whole picture. By introducing the Structured Context Hypothesis, this paper demonstrates that text is not a flat sequence of words. It is a hierarchical structure, a tree of logic and narrative. The “contours” of surprisal—the ups and downs of information flow—map onto this tree. For students of NLP and linguistics, the takeaway is clear: when modeling long-form text, we cannot ignore the discourse structure. We are not just predicting the next token; we are navigating a landscape of nested arguments, where our position in the hierarchy determines the flow of information. The next time you read a novel and feel the pacing accelerate during a climax or slow down during a descriptive passage, you aren’t just imagining it. You are experiencing the surprisal contour, carefully modulated by the hidden structure of the story.
The UID Hypothesis
The Problem: Reality is Bumpy

The Structured Context Hypothesis
Understanding Rhetorical Structure Theory (RST)

Methodology: How to Predict Surprisal
1. Measuring Information (The Dependent Variables)


2. The Predictors (The Independent Variables)

Summary of Variables

Experiments and Results
Finding 1: Position Matters

Finding 2: Hierarchy is Key
Finding 3: RST Outperforms Prose

Finding 4: Consistency Across Languages


Why Does This Happen?
Conclusion: Toward a 3D View of Text
](https://deep-paper.org/en/paper/file-3685/images/cover.png)