Introduction

Imagine a student taking a physics exam. They get the final answer right. Does that mean they understand physics? Maybe. Or maybe they memorized the answer key. Or perhaps they made two calculation errors that canceled each other out. Without seeing their “scratch work”—the intermediate steps of reasoning—it is impossible to know if they truly understand the material or if they are just good at mimicking the right output.

This is the current crisis in evaluating Large Language Models (LLMs). We generally test models like GPT-4 or LLaMA on massive benchmarks (like MMLU) and look at the final accuracy. We ask, “Did the model get the right answer?” but rarely ask, “Did the model use the right cognitive skills to get there?”

Because LLMs are trained on vast amounts of text, they often conflate linguistic proficiency (knowing how to form sentences and use words) with cognitive capability (reasoning, planning, and world modeling). A model might write a beautiful, grammatically complex sentence that is logically nonsensical. Conversely, a model might have the correct reasoning but fail to articulate it.

In this deep dive, we are exploring a new framework called FAC²E (Fine-grAined and Cognition-grounded Capability Evaluation). This research proposes a fascinating shift: evaluating AI not just by the final output, but by dissociating “language” from “cognition,” inspired by how the human brain processes information. By forcing models to “show their work,” the researchers uncovered a critical gap between what LLMs know and how they use that knowledge.

The Biological Inspiration: Language vs. Cognition

To understand why current benchmarks fail, we first need to look at the human brain. Neuroscience tells us that the brain doesn’t handle all “thinking” in one big lump. There is a distinct separation between:

  1. The Language Network: Areas of the brain sensitive to linguistic regularities, grammar, and sentence structure.
  2. The Multiple-Demand Network: Areas responsible for cognitive challenges, reasoning, memory, and problem-solving.

A human can have perfect grammar but poor reasoning skills, or excellent logical deduction while struggling with vocabulary. The researchers behind FAC²E argue that LLMs should be evaluated with this same distinction. We cannot treat “intelligence” as a single metric. We must dissociate the ability to speak fluent English from the ability to model the world and reason through problems.

The FAC²E Framework

The core contribution of this paper is a structured taxonomy that maps LLM capabilities onto four distinct dimensions, moving from pure language to complex social cognition.

1. The Four Dimensions of Capability

The researchers organized LLM skills into four axes:

  • Linguistic Knowledge: This is the foundational layer. It covers Grammaticality (syntax, subject-verb agreement) and Semantics (word meanings, synonyms, antonyms). It asks: Does the model understand the rules of the language?
  • Formal Knowledge: This moves slightly up the ladder to word-based formal reasoning. It includes Mechanisms (deductive and inductive reasoning) and Skills (numeric logic and symbol manipulation). It asks: Can the model manipulate symbols and follow logical rules?
  • World Modeling: Now we enter the cognitive realm. This covers Remembering (recalling facts and commonsense) and Understanding (grasping narrative structures and events). It asks: Does the model have a coherent internal map of facts and events?
  • Social Modeling: The highest level of complexity. This includes Pragmatics (understanding irony, humor, metaphor) and Theory of Mind (inferring what someone else is thinking or feeling). It asks: Can the model understand the intent behind the words?

2. The Evaluation Pipeline: Show Your Work

Defining the categories is only half the battle. The real innovation of FAC²E is how it measures them. Instead of a simple Q&A, the framework forces the model to break down its process.

The researchers drew upon Cattell’s Theory of Intelligence, which divides intelligence into two types:

  • Crystallized Intelligence: The knowledge you have stored (facts, vocabulary, rules).
  • Fluid Intelligence: The ability to solve novel problems by applying that knowledge.

FAC²E operationalizes this by decomposing every prompt into three sub-steps. It uses a Chain-of-Thought (CoT) style approach to force the model to output three specific things:

  1. Crystallized Step (\(r_1\)): Recall the relevant knowledge.
  2. Fluid Step (\(r_2\)): Apply that knowledge to the specific context.
  3. Problem-Solving Step (\(r_3\)): The final answer.

Illustration of FAC2E pipeline showing the decomposition of a question into crystallized, fluid, and problem-solving steps.

As shown in Figure 1 above, the pipeline takes an input (like an analogy question) and forces the model to “talk to itself.” It creates follow-up questions to elicit the relevant knowledge (Crystallized), then asks how that knowledge applies here (Fluid), and finally asks for the answer.

By evaluating \(r_1\), \(r_2\), and \(r_3\) separately against reference answers, the researchers can pinpoint exactly where a model fails. Did it fail because it didn’t know the fact? Or did it know the fact but fail to apply it?

3. The Instruction Design

To make this work, the prompts given to the models are highly structured. The researchers provide “demonstrations”—examples of how to break down a thought process.

Full version example of the capability-specific instruction for grammaticality.

In the example above regarding grammaticality, notice how the model isn’t just asked “Is this sentence correct?” It is forced to identify the rule (Negative Polarity Items like “any” need a negative scope) and then check the specific sentence against that rule. This exposes the “black box” reasoning process to scrutiny.

Experiments and Results

The researchers tested a wide range of models, from open-source backbones (like LLaMA and T5) to proprietary giants (like GPT-3.5 and GPT-4). They re-formatted 17 diverse benchmarks into this unified FAC²E format.

The Performance Gap: Open vs. Proprietary

The first major finding is the disparity between open-source models and proprietary ones, particularly in “deep” cognition tasks.

Table showing quantitative results across models. Proprietary models show higher scores in World and Social modeling.

As seen in Table 4, while open-source models (in blue) are competitive in Linguistic Knowledge, they fall behind significantly in World Modeling and Social Modeling compared to models like GPT-4 (in red). This confirms the hypothesis that “speaking well” (Linguistic) does not automatically grant “thinking well” (Cognition).

The Knowledge Utilization Gap

Perhaps the most insightful finding involves the difference between Crystallized Performance (\(s_1\)) and Fluid Performance (\(s_2\)).

The researchers found that models often have high Crystallized scores—they recall the facts correctly. However, their Fluid scores drop significantly. They have the knowledge, but they lack the reasoning capability to apply it effectively to the problem at hand.

Bar diagram showing the drop-off between knowledge possession (s1 + s2) and final problem solving (s3).

Figure 3 illustrates this relationship. In many cases, the intermediate performance (the stacked bars representing knowledge and application) is reasonable, but the final problem-solving capability (\(s_3\)) varies. Notably, GPT-3.5 maintains a much higher fluid performance than the open-source counterparts. This suggests that the “secret sauce” of advanced models isn’t just having more data—it’s having better mechanisms for using that data.

Correlations: Language does not equal Cognition

One of the central theses of the paper is that language and cognition are dissociated. The data backs this up.

Heatmap showing pairwise correlations between different capabilities.

The heatmap in Figure 2 reveals that capabilities within the same dimension (e.g., World Modeling skills) correlate strongly with each other. However, Linguistic Knowledge has a much weaker correlation with Social Modeling or World Modeling. Being good at grammar (Ling.Gram) does not predict that a model will be good at Theory of Mind (Social.ToM). This validates the need to evaluate these dimensions separately rather than averaging them into a single score.

Does Size Matter?

The study also looked at how model scale affects these specific capabilities.

Line graph showing performance of LLaMA models of different sizes across tasks.

As shown in Figure 4, scaling up the model (from 7B to 65B) provides a universal boost, but the gains are most dramatic in the complex cognitive tasks (right side of the graph). Smaller models “collapse” on Social Pragmatics and Theory of Mind, whereas the 65B model and GPT-3.5 maintain competence. This suggests that while linguistic fluency can be achieved at smaller scales, cognitive robustness requires scale (or more advanced architectures).

Instruction Tuning vs. Pre-training

Does fine-tuning make a model smarter? The researchers compared different fine-tuning datasets (Alpaca, Flan, ShareGPT).

Graph comparing LLaMA performance with different instruction tuning datasets.

Surprisingly, Figure 5 shows that the choice of instruction-tuning dataset (human-written vs. model-generated) doesn’t create a massive difference in the capability profile. The capability ceilings seem to be determined largely by the pre-training stage (the base model).

The Solution: Knowledge Injection

The analysis revealed a specific weakness: models fail to utilize the knowledge they possess. If this is the bottleneck, can we fix it by externally supplying the reasoning?

The researchers proposed a knowledge-enhanced method. Instead of just asking the model the question, they injected the “Crystallized” (\(R_1\)) and “Fluid” (\(R_2\)) rationales directly into the input.

Diagram showing the knowledge-enhanced baselines where reference rationales are fed back into the model.

In Figure 6, we see the process.

  • Process (a) is the standard attempt.
  • Process (b) feeds the correct knowledge fact (\(R_1\)) to the model.
  • Process (c) feeds the correct fact (\(R_1\)) AND the correct application logic (\(R_2\)) to the model.

The Result: When the researchers injected these rationales into a smaller model (LLaMA 2), the performance skyrocketed.

Radar chart comparing LLaMA 2 performance with and without knowledge injection.

Figure 7 is striking. The green line (standard LLaMA 2) is the smallest shape. The red line (LLaMA 2 with injected knowledge) expands outward, covering almost the same area as the highly tuned LLaMA 2-Chat model (orange).

This proves a critical point: The base models are not necessarily “stupid”; they are just “inarticulate” in their reasoning. When guided through the logic explicitly, they can perform at a much higher level.

Conclusion and Implications

The FAC²E framework offers a necessary correction to how we view Artificial Intelligence. By treating LLMs as monolithic “black boxes,” we miss the nuance of their failures and successes.

Key Takeaways:

  1. Language \(\neq\) Cognition: We must evaluate linguistic fluency and cognitive reasoning as separate tracks. A model can write poetry while failing basic logic.
  2. The Utilization Gap: Models often “know” the facts (Crystallized intelligence) but fail to apply them to new problems (Fluid intelligence).
  3. The Stepwise Solution: Decomposing prompts to force specific reasoning steps (Recall \(\rightarrow\) Apply \(\rightarrow\) Solve) provides a better diagnostic tool and, potentially, a way to prompt models for better performance.

For students and researchers entering the field, this paper highlights that the future of LLM development isn’t just about making models larger or feeding them more data. It is about architectural and training improvements that bridge the gap between storing knowledge and using it—helping the machine move from simple recall to genuine fluid intelligence.