How AI Learns to Read Like Humans: Inside ScanEZ

Introduction

Reading feels like a continuous, fluid process. Your eyes glide across this sentence, absorbing meaning instantly—or at least, that is how it feels. In reality, human reading is a jerky, erratic ballet. Your eyes make rapid movements called saccades, stopping briefly at specific points called fixations. You might skip a common word like “the,” dwell longer on a complex word like “spatiotemporal,” or even jump backward (regress) to re-read a confusing phrase.

This sequence of eye movements—the spatial coordinates and the time spent at each spot—is known as a scanpath.

For computer scientists and cognitive psychologists, predicting these scanpaths is the “Holy Grail” of reading comprehension modeling. If an Artificial Intelligence (AI) can accurately predict how a human will visually process a text, it opens the door to revolutionary applications. Imagine educational software that detects reading difficulties in real-time, or Natural Language Processing (NLP) models that “read” documents with the same nuance as a human expert.

However, there is a major bottleneck: data scarcity. Collecting high-quality eye-tracking data requires expensive equipment and human participants. While Large Language Models (LLMs) like GPT-4 are trained on trillions of words, eye-tracking datasets often contain only a few thousand sentences.

In this post, we will deep dive into ScanEZ, a new framework presented by researchers from the University of Colorado Boulder, University of Marburg, and HK3Lab. This paper introduces a clever solution to the data shortage: instead of waiting for more human data, the researchers taught an AI to “hallucinate” reading patterns using cognitive science, and then refined it with real human behavior.

The Challenge: Spatiotemporal Modeling with Limited Data

To understand why ScanEZ is significant, we first need to understand the complexity of the problem. A scanpath isn’t just a list of words; it is a spatiotemporal trajectory.

Spatial (\(x, y\)): Where do the eyes land? The eyes don’t always land in the center of a word. They land on specific characters, often influenced by the length of the upcoming word or the linguistic complexity of the current one.
Temporal (\(t\)): How long do the eyes stay there? Fixation duration is a direct proxy for cognitive processing. A longer pause usually indicates the brain is working harder to process syntax or meaning.

Most previous approaches treated this as a purely sequence problem or focused only on the spatial aspect (the order of words). They often ignored the duration (\(t\)), missing half the story. Furthermore, deep learning models are data-hungry. Without massive datasets, they struggle to generalize.

The ScanEZ framework tackles this by combining Self-Supervised Learning (SSL) with Cognitive Models.

The Solution: The ScanEZ Framework

The core philosophy of ScanEZ is to bridge two worlds: the data-driven world of modern Deep Learning and the theory-driven world of Cognitive Science.

The framework operates in a two-stage pipeline: Pre-training and Fine-tuning.

Overview of the workflow combining synthetic and human eye-tracking data for scan-path prediction. Synthetic scanpaths generated with the E-Z Reader model from CNN + DM texts are used in the pre-training phase of SCANEZ. The model is then fine-tuned on real human data.

As shown in Figure 1 above, the process begins with synthetic data and ends with real human data. Let’s break down each component of this architecture.

1. Generating Synthetic Data with E-Z Reader

Since real eye-tracking data is scarce, the researchers asked: Can we generate realistic fake data?

To do this, they utilized E-Z Reader (Reichle et al., 2003), a well-established computational cognitive model. E-Z Reader is not a neural network; it is a set of mathematical rules derived from decades of psychology research. It simulates how the brain processes words and commands the eyes to move. It accounts for factors like:

Word Frequency: How common is the word?
Predictability: How likely is this word to appear in this context?
Visual acuity: How clear is the text in peripheral vision?

The researchers took the massive CNN & Daily Mail corpus (a collection of news articles) and fed it into the E-Z Reader model. The result was a dataset of millions of synthetic scanpaths. While these simulations aren’t perfectly human, they provide a strong “inductive bias”—a baseline understanding of how reading generally works.

The difference in scale between this synthetic data and available human data is staggering:

Table 2: Descriptive statistics of the used datasets. Top: synthetic datasets used for pre-training; bottom: human datasets used in our experiments.

As Table 2 shows, the synthetic pre-training data (CNN + Daily Mail) contains over 10 million simulated sentences. In contrast, the real human datasets (CELER, ZuCo, EML) contain only a few hundred to a few thousand sentences. This massive synthetic dataset is what allows ScanEZ to learn robust representations before it ever sees a real human eye movement.

2. The Model Architecture

At the heart of ScanEZ is a BERT-style Transformer. If you are familiar with NLP, you know that Transformers are excellent at handling sequences.

The input to the model is a sequence of fixations. Each fixation is represented by three numbers:

\(x\)-coordinate
\(y\)-coordinate
\(t\) (duration)

These inputs are normalized and passed through an embedding layer to project them into a dense vector space. The model adds sinusoidal positional encodings to these embeddings. This is crucial because, unlike a Recurrent Neural Network (RNN) that processes data strictly in order, a Transformer processes the whole sequence at once. The positional encodings tell the model the order of the fixations (i.e., “this fixation happened 1st, this one happened 2nd”).

3. Masked Gaze Modeling

How does the model actually learn? The researchers adapted a technique called Masked Language Modeling (MLM), famously used by BERT.

In NLP, MLM works by hiding a word in a sentence (e.g., “The cat sat on the [MASK]”) and forcing the AI to guess the missing word based on context. ScanEZ does the exact same thing, but for eye movements.

The researchers randomly mask a percentage of the fixation points in the trajectory. The model sees the surrounding context—where the eyes were before and after the missing point—and must predict the spatial coordinates (\(x, y\)) and the duration (\(t\)) of the masked fixation.

This forces the model to learn the deep, underlying dependencies of reading behavior. It learns, for example, that a fixation on a difficult subject is often followed by a regression, or that a short function word is often skipped.

4. The Loss Function

To train the model, the researchers use a loss function that measures how “wrong” the model’s prediction is compared to the ground truth. Specifically, they use the Mean Squared Error (MSE) for the masked indices.

Equation representing the Loss Function used for training.

In this equation:

\(\mathcal{M}\) represents the set of masked items (the ones the model has to guess).
\(X_i\) is the actual ground truth (the real coordinate and time).
\(\hat{X}_i\) is the model’s prediction.

The model tries to minimize this difference, sharpening its ability to predict exactly where and for how long a reader will look.

Experimental Setup

To prove ScanEZ works, the researchers tested it against the current state-of-the-art model, Eyettention (Deng et al., 2023).

They used three diverse human datasets for fine-tuning and evaluation:

CELER L1: Native English speakers reading sentences from the Wall Street Journal.
ZuCo 1.0: Sentences from Wikipedia and movie reviews.
EML: A complex dataset involving educational texts (biology, history), which is much harder to process than standard sentences.

Evaluation Metrics

Evaluating scanpaths is tricky. You can’t just ask “did it get the right word?” because fixations happen in continuous space and time. The researchers used several metrics:

NLL (Negative Log-Likelihood): A statistical measure of how well the predicted probability distribution fits the real data. Lower is better.
NLD (Normalized Levenshtein Distance): Measures the “edit distance” between the predicted sequence of fixations and the real one. Lower is better.

Crucially, they also introduced accuracy metrics for the specific dimensions of reading:

Fixation Duration Accuracy (FDA): This metric measures how close the predicted time (\(T_{pred}\)) is to the actual time (\(T_{ground}\)). A score of 1 means a perfect prediction.

Equation for Fixation Duration Accuracy (FDA).

Fixation Location Accuracy (FLA): This measures how close the predicted spatial coordinates (\(X, Y\)) are to the actual landing position.

Equation for Fixation Location Accuracy (FLA).

Results: A New State-of-the-Art

The results were compelling. ScanEZ outperformed the baseline (Eyettention) across almost all experimental settings.

Beat-by-Beat Comparison

Let’s look at the performance on the CELER L1 dataset. The researchers tested the models using a “Participant + Text Split” (P.T.), which is the hardest setting. In this scenario, the model is tested on new participants reading new texts that it has never seen during training.

Table 1: Top: comparison between our framework and Eyettention on the CELER L1 dataset (CLR) to our model trained on: only EZ-Reader data (w/o Fine-tuning), only human data (w/o Pre-training), and with both pre-training on EZ-Reader and then fune-tuning on human data (ScanEZ). Bottom: evaluation using the EML dataset.

Referring to Table 1 (top section):

NLL (Lower is better): ScanEZ achieved a score of 1.524, significantly lower than Eyettention’s 2.297. This represents a massive improvement in how well the model fits the data.
NLD (Lower is better): ScanEZ scored 0.421, compared to Eyettention’s 0.568. This means ScanEZ’s predicted scanpaths were structurally much closer to the human behavior.

The Power of “Learning to Read” Before Reading

One of the most interesting parts of this paper is the Ablation Study—a set of experiments where researchers remove parts of the system to see what breaks.

Look at the rows in Table 1 labeled w/o Fine-tuning and w/o Pre-training:

Without Fine-tuning: If you only train on the synthetic E-Z Reader data and never show the model real human eyes, the performance is poor (NLL 3.035). This confirms that while cognitive models are good, they aren’t human enough on their own.
Without Pre-training: If you skip the synthetic data and train only on the small human dataset, the model performs decently (NLL 1.772), but still worse than the full ScanEZ system.

The Conclusion: You need both. The synthetic data acts as a “kick-start,” teaching the model the general physics of reading. The human data then refines this knowledge, correcting the mechanical rigidity of the simulation with the messy reality of human behavior.

Detailed Breakdown

For a more granular view, we can look at how ScanEZ performed across different splitting strategies.

Table 3: Performance on CELER L1 across the three split settings. Our model, ScanEZ, improves NLL and NLD and it benchmarks temporal predictions.

Table 3 confirms that whether you split the data by Text (new sentences) or Participant (new readers), ScanEZ consistently holds the lead. It also provides the NLL_t score (Negative Log-Likelihood for time), which Eyettention cannot provide because it doesn’t model time explicitly.

Cross-Dataset Generalization

A common failure mode in AI is “overfitting”—memorizing one specific dataset but failing on another. To test this, the researchers trained ScanEZ on CELER L1 and tested it on ZuCo 1.0 (and vice versa).

Table 4: Cross-dataset results following the training-testing set combinations. Our model demonstrates better transfer performance than Eyettention based on NLL.

As shown in Table 4, ScanEZ demonstrates superior transferability. When trained on CELER L1 and tested on ZuCo, it achieved an NLL of 0.548, whereas Eyettention skyrocketed to 2.613. This suggests that ScanEZ isn’t just memorizing dataset quirks; it is learning universal principles of reading behavior.

Why This Matters

ScanEZ represents a shift in how we approach problems with limited data. Rather than viewing the scarcity of eye-tracking data as a dead end, the authors utilized the wealth of knowledge embedded in decades of cognitive science.

By translating a psychological theory (E-Z Reader) into a dataset, they effectively “downloaded” human knowledge into a format a Transformer could understand.

Key Takeaways:

Hybrid Approach: Combining cognitive models (simulations) with deep learning (transformers) yields better results than either approach alone.
Spatiotemporal is Essential: Reading is about where and when. Modeling fixation durations improves the overall accuracy of the trajectory.
Self-Supervised Learning: The masked modeling objective allows the AI to learn complex dependencies in eye movements without requiring labeled data for every possible scenario.

This research paves the way for more “human-aware” AI. In the future, your e-reader might notice you struggling with a paragraph and offer a definition, or a search engine might detect that you are skimming and offer a summary instead. By understanding the scanpath, AI gets a glimpse into the mind of the reader.

Introduction#

The Challenge: Spatiotemporal Modeling with Limited Data#

The Solution: The ScanEZ Framework#

1. Generating Synthetic Data with E-Z Reader#

2. The Model Architecture#

3. Masked Gaze Modeling#

4. The Loss Function#

Experimental Setup#

Evaluation Metrics#

Results: A New State-of-the-Art#

Beat-by-Beat Comparison#

The Power of “Learning to Read” Before Reading#

Detailed Breakdown#

Cross-Dataset Generalization#

Why This Matters#