Introduction
The ability to look inside the human mind and see what a person is imagining has long been the realm of science fiction. From Inception to Black Mirror, the concept of a “dream recorder” captures our collective imagination. However, in the field of computational neuroscience, this is not fiction—it is an active, rapidly evolving area of research known as fMRI-to-image reconstruction.
In recent years, we have seen an explosion in the capabilities of AI to reconstruct images that a person is viewing based solely on their brain activity. Models can now take functional Magnetic Resonance Imaging (fMRI) scans of a person looking at a surfer and generate a recognizable image of a surfer. But a much harder challenge remains: can we reconstruct what a person is merely imagining?
When you close your eyes and picture a red apple, your brain is active, but the signals are fainter and “noisier” than when you actually see an apple. Until recently, we lacked the data to properly train and test models on this internal “mind’s eye.”
This blog post explores a pivotal research paper, “NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery.” The researchers have released a groundbreaking dataset and performed an extensive analysis of state-of-the-art AI models. Their findings reveal that while we are getting closer to reading mental imagery, the best models for vision are not necessarily the best models for imagination.

Background: Vision vs. Imagery
To understand the magnitude of this contribution, we first need to distinguish between visual perception and mental imagery.
Visual perception occurs when light hits your retina, sending powerful, structured signals through your optic nerve to the visual cortex. It is a “bottom-up” process driven by external stimuli. Because the signal is strong and spatially organized, mapping fMRI data to the image seen (vision decoding) has achieved high fidelity.
Mental imagery, on the other hand, is a “top-down” process. It originates from frontal brain regions and memory systems, sending feedback signals to the visual cortex to simulate an image. These signals have a much lower Signal-to-Noise Ratio (SNR). They are fuzzier, less detailed, and harder to isolate from the brain’s background activity.
The Data Gap
As shown in Figure 1 above, previous datasets for mental imagery were limited. Some used simple geometric shapes (blobs or letters), while others had very few subjects or trials. This limited the ability of modern Deep Learning models, which hunger for massive amounts of data, to learn the complex mapping between “thought” and “image.”
The Natural Scenes Dataset (NSD) changed the game for vision decoding by providing tens of thousands of trials of people viewing complex photos. The paper we are discussing today introduces NSD-Imagery, an extension of that massive dataset where the same subjects were asked to engage in mental imagery tasks. This allows researchers to perform cross-decoding: training an AI on the high-quality “vision” data and testing if it can generalize to the difficult “imagery” data.
The Core Method: Structuring the “Mind’s Eye”
The researchers designed an experiment to rigorously test how well models can decode imagination. They gathered data from 8 subjects who had previously participated in the original NSD experiments.
The Experimental Tasks
The subjects participated in two primary types of runs:
- Vision Task: The subject sees an image (e.g., a donut) and a cue letter. This establishes the “ground truth” of how their brain reacts to seeing the object.
- Imagery Task: The subject sees only the cue letter and is instructed to vividly imagine the corresponding image for 3 seconds.

The Stimuli
To test the limits of the models, the researchers didn’t just use random photos. They selected three specific categories of stimuli:
- Simple Stimuli: Geometric shapes like oriented bars, crosses, and “X” shapes. These test if the model can capture basic structural information.
- Complex Stimuli: Natural photographs (e.g., a landscape, a surfer, a crowd). These test the model’s ability to reconstruct semantic content and details.
- Conceptual Stimuli: Single words like “mammal” or “stripes.” Here, the subject imagines a generic version of the concept.
The Models
The study evaluated five state-of-the-art vision decoding models, all trained on the original NSD (vision) data:
- MindEye1 & MindEye2: Recent high-performers that use contrastive learning and massive diffusion models (like those behind Stable Diffusion).
- Brain Diffuser: A model that combines simple linear decoding with generative diffusion.
- iCNN: An older deep learning method.
- Takagi et al.: Another latent diffusion approach.
The critical question was: Can models trained only on seen images reconstruct imagined ones?
Experiments & Results
The results of the study are both promising and surprising. They offer a nuanced view of where current AI stands in the quest to decode thought.
1. Qualitative Success: We Can See Your Thoughts
First, the good news. The best models can reconstruct mental images. When subjects imagined complex scenes—like a surfer or a plate of food—the models often generated images that matched the semantic category and composition of the thought.

In Figure 4, look at the “Complex Stimuli” rows (the natural photos).
- Row 3 (Surfer): Notice how Brain Diffuser and MindEye1 successfully generate an image of a person on a wave. It’s not a pixel-perfect copy, but the content is correct.
- Row 5 (Donuts): Several models successfully generate circular, food-like objects.
However, look at the “Simple Stimuli” (the lines and crosses at the top). The models struggle significantly here. Instead of clean black lines, they often generate weird, textured landscapes that vaguely follow the orientation of the line. This is likely because the generative models (like Stable Diffusion) have strong priors—they are trained on internet photos, so they try to force every brain signal to look like a photograph, even when the subject is imagining a simple line.
2. The Vision Baseline
For comparison, let’s look at how these models perform when the subject is actually seeing the image (Vision task).

As shown in Figure 3, the vision reconstructions are much sharper. MindEye1 and MindEye2 produce results that are strikingly similar to the ground truth. This confirms that the models work exceptionally well on the data they were trained for (perception). The drop in quality we see in Figure 4 (Imagery) is the cost of the “noise” in mental imagery.
3. The “Decoupling” Surprise
Here is the most critical finding of the paper: A better vision model does not equal a better imagery model.
In the world of vision decoding, MindEye2 is currently considered the state-of-the-art (SOTA). It captures incredible detail from seen images. However, when applied to mental imagery, its performance collapses.
Take a look at the correlation analysis below:

In Figure 6, the X-axis represents how well a model reconstructs seen images, and the Y-axis represents how well it reconstructs imagined images.
- Brain Diffuser (Green squares): Performs reliably well on both.
- MindEye2 (Purple triangles): Shows a weak correlation. Despite being amazing at vision, it fails to generalize to imagination.
Why? The researchers hypothesize that complex architectures like MindEye2 essentially “overfit” to the high-fidelity signal of the visual cortex. They rely on fine-grained details that simply aren’t present in mental imagery. Simpler models like Brain Diffuser, which use robust ridge regression and multimodal features (images + text), are more forgiving and better at picking up the “gist” of the thought, which is exactly what mental imagery provides.
4. Human Evaluations: The Complexity Paradox
The researchers didn’t just rely on computer metrics; they asked 500 human raters to identify the reconstructed images.

Figure 5 reveals a fascinating “Complexity Paradox.” You might assume that imagining a simple white line is easier than imagining a complex landscape. But the results show the opposite:
- Complex scenes (Orange line) were reconstructed more accurately from imagination than Simple shapes (Green line) were reconstructed from vision.
This is likely due to the “priors” mentioned earlier. The AI models “know” what a landscape looks like. When they detect a faint “outdoor” signal from the brain, they can fill in the gaps with a realistic image. But when they receive a signal for “vertical line,” the AI—trained on natural photos—doesn’t know how to interpret it, often hallucinating a fence or a tree trunk instead.
Conclusion & Implications
The release of NSD-Imagery marks a significant step forward in neuroscience and AI. It provides the benchmark needed to move from decoding what we see to decoding what we think.
Key Takeaways
- Generalization is Possible: We can use models trained on vision to decode imagination, avoiding the need to collect thousands of hours of difficult imagery data.
- Architecture Matters: Newer, bigger models aren’t always better. For the noisy, low-resolution signal of human thought, simpler and more robust architectures (like Brain Diffuser) currently win out over complex, specialized ones (like MindEye2).
- The Prior Problem: Current AI models are biased toward natural images. To decode abstract thoughts or simple shapes, we may need to retrain the generative backbones of these systems.
Broader Impact
Why does this matter? Beyond the cool factor, this technology has profound medical potential.
- Communication: For patients who are “locked-in” (conscious but unable to move or speak), reliable imagery decoding could provide a communication channel. If a patient imagines a specific scene (e.g., “beach” for yes, “forest” for no), these models could translate that into a message.
- Diagnosis: It could help diagnose patients with disorders of consciousness, determining if an unresponsive patient is actually covertly conscious.
The path to a true “dream recorder” is still long and full of noise, but papers like this show us that the signal is there—we just need to build the right antenna to catch it.
](https://deep-paper.org/en/paper/2506.06898/images/cover.png)