Can We Teach AI to Read Like a Human? Reverse-Engineering the Cognitive Reader

How does your brain process the sentence you are reading right now? You likely don’t read every letter with equal attention. You skip over function words like “the” or “of,” and you linger slightly longer on complex or unexpected words. This varying “processing effort”—measurable by how long your eyes gaze at a specific word—is a window into the human mind.

For years, researchers have used Large Language Models (LMs) like GPT-2 as proxies to study this phenomenon. The prevailing theory is that LMs and humans share a fundamental mechanism: prediction. If an LM finds a word unexpected (high surprisal), humans usually find it harder to process (longer reading time).

But essentially, this research has been a one-way street. We take a pre-trained model, optimized to predict the next word on massive internet text, and ask: “Does this look like a human?”

A fascinating research paper, “Reverse-Engineering the Reader,” flips this question on its head. Instead of asking if LMs happen to resemble humans, the authors ask: Can we directly optimize a Language Model to be a useful cognitive model?

Can we take an AI and fine-tune it not to write better essays or code, but to read exactly like a human does? And if we do, what happens to the model’s intelligence? The results offer a surprising glimpse into the difference between statistical probability and biological cognition.

The Foundation: Surprisal Theory

To understand how we can “reverse-engineer” a reader, we first need to understand the bridge between Artificial Intelligence and Psychology: Surprisal Theory.

Surprisal theory posits that the cognitive effort required to process a word is proportional to how unexpected that word is given its context. If you read the sentence “The cat sat on the…”, your brain expects “mat.” If the next word is “mat,” the surprisal is low, and your eyes move quickly. If the next word is “stratosphere,” the surprisal is high, and your eyes linger to process the anomaly.

In mathematical terms, for a linguistic unit \(u\) (like a word) in a context \(c\), the surprisal is the negative log probability:

Surprisal definition equation.

Here, \(p_H\) represents the “human language model”—the theoretical probability distribution inside our heads. Since we cannot open up a brain and read the probabilities directly, researchers approximate this using a computational language model (\(p_\theta\)):

Approximated surprisal equation using model parameters theta.

The standard methodology in psycholinguistics is to assume a linear relationship. We assume that the reading time (\(\psi\)) is roughly equal to the model’s surprisal value multiplied by some coefficient, plus some baseline factors.

The Standard Evaluation: Delta Log-Likelihood

How do we know if a model is a good predictor of human reading? We use a metric called Delta Log-Likelihood (\(\Delta_{\text{llh}}\)).

Imagine we have two predictors trying to guess how long you look at a word:

  1. A Baseline Predictor: This knows the word’s length and how common the word is globally (frequency), but it ignores the specific sentence context.
  2. The Target Predictor: This knows everything the baseline knows, plus the contextual surprisal provided by our Language Model.

We compare how likely the actual human reading times are under these two models. If the Target Predictor (using the LM’s surprisal) explains the data much better than the Baseline, we get a high positive \(\Delta_{\text{llh}}\).

Delta Log-Likelihood equation comparing target and baseline models.

Typically, researchers just measure this metric on existing models. The authors of this paper, however, decided to treat \(\Delta_{\text{llh}}\) not as a scoreboard, but as an objective function. They wanted to train the model to maximize this score.

The Method: Psychometric Alignment

This approach is conceptually similar to “Alignment” in modern AI—like Reinforcement Learning from Human Feedback (RLHF), which aligns models to human preferences (e.g., “don’t be toxic,” “be helpful”). However, instead of aligning to binary preferences (A is better than B), the researchers align the model to real-valued psychometric data (reading times).

This presents a difficult mathematical challenge. We aren’t just trying to make the model output the “correct” next token. We are trying to adjust the model’s internal probability distribution (\(p_\theta\)) so that its surprisal values, when fed into a linear regression model, accurately predict milliseconds of gaze duration.

The Objective Function

The core innovation here is a technique to fine-tune the LM to implicitly optimize the parameters of a linear regressor.

Let’s break down the “Reward” function the researchers designed. They define the reward \(r(\theta)\) as the negative error (Mean Squared Error) between the human reading times (\(\psi\)) and the predicted times (\(\hat{\psi}\)).

Reward function defined as the negative minimum expected mean squared error.

Notice the “min” inside the equation. For every step of training the Language Model, the system effectively finds the best possible linear regression fit (\(\beta_\theta\)) for the current state of the model, and then calculates the error.

To actually make this computable during training, they approximate the reward using a batch of data. The optimal coefficients for the linear regression (the slope and intercept that relate surprisal to time) can be calculated using a closed-form solution known as Ridge Regression.

Formula for optimal coefficients beta star using ridge regression.

Here, \(X_\theta\) is the matrix containing the surprisal values generated by the model. By plugging this optimal \(\beta^*\) back into the error formula, the researchers created a differentiable pipeline. They can update the weights of the GPT-2 model to minimize the error in predicting reading times.

Regularization: Keeping the Model Sane

If we only trained the model to predict reading times, it might destroy its knowledge of the English language. It might assign bizarre probabilities just to fit the regression line.

To prevent this, the researchers use Kullback–Leibler (KL) Regularization. This adds a penalty if the fine-tuned model (\(p_\theta\)) diverges too far from the original, pre-trained reference model (\(p_{\text{ref}}\)).

KL Regularization term definition.

The final training objective combines the psychometric reward and the KL penalty:

Total objective function combining reward and KL regularization.

This forces the model to balance two goals: “Stay a good language model” (via the KL term) and “Become a good cognitive model” (via the reward term).

Experimental Setup

The researchers tested this method using the GPT-2 family of models (Small, Medium, and Large).

For the human data, they utilized three famous eye-tracking corpora:

  1. Dundee Corpus: Newspaper text read by 10 participants.
  2. Provo Corpus: Fiction and non-fiction paragraphs read by 84 participants.
  3. ZuCo Corpus: Movie reviews and Wikipedia articles.

The primary metric was Gaze Duration—the total time a reader’s eyes fixate on a word before moving away.

The experiment was set up as a cross-evaluation. For example, they might fine-tune the model on the Dundee corpus (teaching it how Dundee participants read) and then test how well it predicts reading times in the Provo corpus. This ensures the model is learning general principles of human reading, not just memorizing specific participants’ quirks.

Did It Work? The Results

The short answer is: Yes.

The fine-tuning technique successfully “reverse-engineered” the reader. Across almost all model sizes and datasets, the fine-tuned models became significantly better predictors of human reading times than the original, off-the-shelf GPT-2 models.

Look at the learning curves below. The top row shows the Mean Squared Error (MSE) dropping, and the bottom row shows the \(\Delta_{\text{llh}}\) rising over training steps.

Figure 1: Learning curves showing decreasing MSE and increasing Delta Log-Likelihood over fine-tuning steps.

The purple lines (GPT-2 Small) and blue lines (GPT-2 Large) both show clear improvement. This confirms that it is mathematically possible to adjust an LM’s internal probabilities to better align with biological processing data.

What Did the Model Learn?

When the researchers analyzed the regression coefficients, they found something interesting. Remember, the linear model predicts reading time based on several factors, including word length and surprisal.

As fine-tuning progressed, the coefficient for surprisal consistently increased (see the top-left panel below).

Figure 3: Coefficients over fine-tuning. Surprisal coefficient increases, length coefficient decreases.

This implies that to better mimic human data, the model learned to rely more heavily on predictability (surprisal) and less on surface-level features like word length. The fine-tuned model became more “sensitive” to context, just as human readers are.

The Great Trade-off: Psychometric Fit vs. Language Modeling

Here lies the most profound finding of the paper.

In the NLP world, “better” usually means lower perplexity. Perplexity measures how confused a model is; a perfect model would assign high probability to the next word in a sequence (low perplexity). We generally assume that the smarter the model (lower perplexity), the more “human-like” it is.

However, recent studies have hinted at a divergence: super-large models (like GPT-4) are actually worse predictors of human reading times than smaller, dumber models. The theory is that these models are “super-human”—they predict upcoming words so easily that they don’t reflect the struggle humans face when processing complex sentences.

This paper provides causal evidence for this theory. By forcing the model to align with human reading times (increasing \(\Delta_{\text{llh}}\)), the researchers observed that perplexity got worse.

Figure 2: Plot showing the inverse relationship between Perplexity and Delta Log-Likelihood.

In the graph above, the y-axis is psychometric fit (\(\Delta_{\text{llh}}\)) and the x-axis is Log Perplexity. As the curves move up (better reading prediction), they move to the right (worse language modeling).

This suggests a fundamental tension. To model the human mind accurately, we have to make our AI models worse at predicting text. We have to introduce the same kinds of uncertainties and inefficiencies that biological brains possess.

The Role of Regularization

The researchers found that the KL regularization term (controlled by the parameter \(\lambda\)) was crucial for managing this trade-off.

Figure 4: Trajectories of metrics for different KL coefficient values.

  • \(\lambda = 0\) (No regularization): The model becomes a great predictor of reading times (high \(\Delta_{\text{llh}}\)), but its perplexity explodes. It essentially “breaks” as a language generator.
  • \(\lambda = 500\) (Strong regularization): The model retains its language abilities (low perplexity change) but gains less improvement in predicting reading times.

This confirms that the “Human Reading” distribution (\(p_H\)) is mathematically distinct from the “Optimal Text Prediction” distribution (\(p_{\text{ref}}\)). You cannot maximize both simultaneously.

Does Thinking Like a Human Make You Smarter?

If we train a model to process information like a human, does it become better at understanding language tasks?

To test this, the researchers evaluated the fine-tuned models on benchmarks like BLiMP (Grammaticality judgment) and LAMBADA (Narrative prediction).

The results were sobering.

Figure 5: Results on BLiMP showing decreased accuracy after fine-tuning.

The hatched bars represent the original models; the solid bars are the fine-tuned ones. In almost every case, fine-tuning on human reading data decreased performance on downstream NLP tasks.

This reinforces the “trade-off” hypothesis. Human cognition is noisy, limited by working memory, and prone to error. An AI optimized to mimic these biological constraints is naturally going to be less “capable” on tasks that require perfect statistical inference.

Conclusion: Why This Matters

“Reverse-Engineering the Reader” is a landmark step in computational psycholinguistics. It moves beyond passive observation—checking if LMs happen to look like humans—to active experimentation.

The key takeaways for students and researchers are:

  1. Alignment is possible: We can use psychometric data (like eye-tracking) directly in the loss function of neural networks. We don’t need to rely solely on text prediction or human preference labeling.
  2. The Perplexity Gap: There is a causal, inverse relationship between a model’s statistical quality and its fit to human processing data. To simulate a human, you might have to “lobotomize” a state-of-the-art AI.
  3. New Tools for Science: This method gives psychologists a new tool. Instead of debating which pre-trained model fits their theory best, they can generate custom models optimized for their specific hypothesis, potentially uncovering the hidden parameters of human cognition.

As we continue to build AIs that interact with us, understanding the divergence between “artificial” probability and “human” expectation becomes critical. This paper suggests that if we want AI to truly understand how we read, we might have to teach it to struggle just a little bit more with the text—just like we do.