There is a central question currently dominating the field of Natural Language Processing (NLP): Are Large Language Models (LLMs) simply “stochastic parrots” mimicking patterns, or do they possess cognitive mechanisms similar to humans?
Much of the current evaluation of LLMs focuses on the end result. If a model answers a question correctly or writes a coherent story, we assume it “understands.” However, cognitive plausibility isn’t just about the output; it is about the process. To truly test if an LLM is cognitively plausible, we need to see if it makes the same distinct mental moves that humans do when processing language.
A fascinating new study, “Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility,” approaches this by examining a subtle but robust quirk in human psychology: the difference between how we produce language (speaking/writing) and how we interpret it (listening/reading).
In this deep dive, we will explore whether LLMs replicate the “asymmetry” humans show between these two modes of communication. The results offer a nuanced look at model architecture, the importance of prompting strategies, and the gap that still exists between artificial and biological intelligence.
The Cognitive Gap: Production vs. Interpretation
To understand the experiment, we first need to understand the human brain. Historically, linguists and neuroscientists treated language production (speaking) and interpretation (understanding) as distinct processes. While modern theories suggest they are tightly linked, humans still exhibit different biases depending on which mode they are in.
Consider the fundamental unit of an LLM: the probability of the next token given the context, or \(P(\text{token}|\text{context})\). For an LLM, writing a sentence and understanding a sentence are mathematically very similar—they both involve predicting the next word. But for humans, these tasks trigger different biases.
The Test Case: Implicit Causality
The researchers utilized a linguistic phenomenon known as Implicit Causality (IC) to test this. IC refers to how certain verbs influence our expectations about who will be mentioned next in a sentence.
Consider these two sentences:
- John infuriated Bill.
- John praised Bill.
In sentence (1), the verb “infuriated” implies that John did something specific. John is the cause. We call this a Subject-Biased (IC1) verb. If you ask a human to continue this story (“John infuriated Bill…”), they are statistically more likely to talk about John.
In sentence (2), the verb “praised” implies Bill did something praiseworthy. Bill is the cause. We call this an Object-Biased (IC2) verb. Humans are more likely to continue by talking about Bill.
The Asymmetry
Here is where it gets interesting. We can test this bias in two ways:
- The Production Task (Next-Mention Bias): We give the participant the sentence “John infuriated Bill…” and ask them to write what happens next. They might write, “He kept yelling.”
- The Interpretation Task (Pronoun Resolution): We give the participant the sentence “John infuriated Bill. He…” and ask them who “He” refers to.
Logically, you might expect the probability of choosing “John” to be the same in both tasks. If “John” is the most likely topic, he should be the most likely topic.
However, psycholinguistic research shows a robust asymmetry. Humans have a “Subject Bias” bonus in interpretation. When we encounter a pronoun like “he,” we are more likely to resolve it to the subject (John) than we are to simply generate a sentence about John in a free-writing task.
Why? Because when humans interpret, we reason about why the speaker used a pronoun. We calculate \(P(\text{Pronoun}|\text{Subject})\). We implicitly know that speakers like to use pronouns for the main subject. LLMs, which only calculate “what comes next,” may not naturally capture this distinction.
Methodology: Putting LLMs on the Couch
The researchers sought to answer two main questions:
- Do LLMs show the basic Implicit Causality effect (distinguishing between “infuriate” and “praise”)?
- Do LLMs show the human-like asymmetry (a stronger subject bias in interpretation than production)?
The Setup
The team constructed a dataset using 137 Subject-biased (IC1) verbs and 134 Object-biased (IC2) verbs. They created 541 items involving distinct male or female names to ensure pronouns would be ambiguous (e.g., “John” and “Bill” are both “he,” whereas “John” and “Mary” would make the pronoun obvious).
They tested four major models:
- LLaMA-3.1-Instruct-8B (Smaller open-source)
- QWen2.5-Instruct-32B (Mid-sized)
- LLaMA-3.3-Instruct-70B (Large open-source)
- GPT-4o (Proprietary state-of-the-art)
Meta-Linguistic Prompting
Since LLMs don’t have “brains” to scan, the researchers used Meta-Linguistic Prompting. They asked the models to perform the tasks using four different prompt styles. This is crucial because, as we will see, how you ask an LLM questions changes how it thinks.
- Binary Choice: “Who is more likely to be the subject? John or Bill?”
- Continuation: “Please reasonably continue the sentence…” (The model writes a sentence, and the researchers analyze who it wrote about).
- Yes/No: “Does the pronoun refer to John? Answer Yes or No.”
- Yes/No Probability: Same as above, but measuring the mathematical probability the model assigns to the “Yes” token.
Data Cleaning and Validity
To ensure the “Continuation” prompt results were valid, the authors manually verified the outputs. If a model generated nonsense, ambiguous text, or used “They” (plural), it was excluded.

As shown in Table 1, larger models like GPT-4o and LLaMA-70B were much better at following instructions, producing very few nonsensical or plural responses compared to the smaller LLaMA-8B.
Experimental Results
The results provide a complex picture of machine cognition. The models do not behave uniformly; their ability to mimic human biases changes based on their size and the specific prompt used.
The Big Picture
Figure 1 (below) visualizes the performance across all models and prompting strategies. The bars represent the proportion of times the Subject was chosen.
- Red Bars: Production Task.
- Teal Bars: Interpretation Task.
- Human Behavior: Shown in the far right column of each cluster. Note that for humans, the Teal bar is almost always higher than the Red bar for IC1 verbs—that is the “Subject Interpretation Bias.”

Finding 1: The IC Bias Exists, But Is Fragile
The first hurdle for cognitive plausibility is simply recognizing that “infuriate” creates a different expectation than “praise.”
Most models successfully captured this. They generally showed a higher preference for the Subject after IC1 verbs compared to IC2 verbs. However, this wasn’t universal. For example, Qwen struggled to predict this effect in the production task across most prompts.
Finding 2: The Production-Interpretation Asymmetry is Rare
This was the core question: Do LLMs replicate the human “Subject Boost” when resolving pronouns (Interpretation) versus generating text (Production)?
In most cases, no. LLMs generally failed to capture the distinct gap between production and interpretation that characterizes human cognition. In fact, some models showed the reverse pattern, or no difference at all.
However, there were exceptions. The Yes/No prompting strategy was the most successful at revealing this asymmetry. When explicitly asked to judge a reference (“Does ‘he’ refer to John?”), LLaMA models and GPT-4o began to show patterns aligning with human biases.
Finding 3: Scaling Matters
The size of the model played a significant role in cognitive plausibility.
Let’s look at LLaMA-3.1-8B (the smaller model). When using Yes-No Prompting, the statistical analysis showed significant interaction effects.

In Table 6, the verb:task interaction is negative (-1.99), and the credible interval doesn’t cross zero. This suggests the model does differentiate between tasks, but as we look deeper into the pairwise comparisons (Table 7 below), we see the nuance.

The smaller model replicates the direction of the asymmetry, but often gets the magnitude wrong or flips behaviors in complex prompts like “Continuation” (where it performed poorly).
Now compare this to LLaMA-3.3-70B (the larger model).

The 70B model shows much cleaner results in the Yes/No setting (Table 13 in the paper, represented here by the regression summary). It captures the IC verb effect and the production-interpretation asymmetry more reliably than its smaller counterpart. This suggests that as models scale, they may naturally acquire more subtle “human-like” processing traits, or at least better approximate them.
Finding 4: GPT-4o and the “Yes/No” Preference
GPT-4o, currently a gold standard in the industry, also showed that prompt selection is critical.

In the Continuation prompt (Table 22), GPT-4o showed a verb estimate of -3.11. This actually indicates a reversed IC verb effect—it was more subject-biased for Object-biased verbs than Subject-biased verbs, which is cognitively implausible.
However, when switched to Yes/No Prompting, GPT-4o aligned much better with human data. This highlights a critical flaw in how we evaluate models: a model might look “cognitively alien” in one prompt format (Continuation) but “cognitively plausible” in another (Yes/No).
Discussion: What Does This Mean for AI?
This research highlights that LLMs are not currently replicating the dual-process nature of human language (production vs. interpretation) by default.
The Problem with “Probabilistic Unity”
Humans have biological reasons for processing language differently when we speak versus when we listen. LLMs, however, are built on a unified objective: predict the next token. They do not have separate modules for “speaking” and “understanding.”
The fact that they can replicate the asymmetry at all (specifically in Yes/No prompts) is surprising. It suggests that the instruction-tuning process (where models are trained to follow chat instructions) might be creating pseudo-cognitive modes that simulate these human distinctions.
The Importance of Prompt Design
One of the paper’s most practical takeaways for students and researchers is the sensitivity of Meta-Linguistic Prompts.
- Continuation prompts (asking the model to write) were generally the worst at capturing human cognitive biases. The authors suggest this is because models are heavily fine-tuned to be “helpful assistants,” which constrains their creative writing and biases them toward specific, safe response patterns that humans don’t have.
- Yes/No prompts were the best. This challenges previous assumptions that using raw probabilities (the log-odds of tokens) is always the best way to measure a model. Sometimes, explicitly asking the model for a judgment yields a more “human” result.
Conclusion
Are LLMs cognitively plausible? The answer from this study is a “qualified maybe.”
They do not naturally process language with the same production-interpretation distinctions that humans possess. A human creates a mental model of the speaker when interpreting; an LLM calculates probabilities.
However, sufficiently large models (like LLaMA-70B and GPT-4o), when prompted correctly, can simulate this behavior. They show us that “understanding” in AI is malleable—it changes based on the size of the neural network and the specific phrasing of the question.
For the future of AI, this suggests that if we want models to truly interact with us naturally, we may need to move beyond simple next-token prediction and look toward architectures or training methods that respect the fundamental differences between being a speaker and being a listener.
](https://deep-paper.org/en/paper/2503.17579/images/cover.png)