Introduction

How does a large language model (LLM) “see” an image? When we feed a photograph of a chest X-ray or a satellite view of a city into a Multimodal Large Language Model (MLLM) like LLaVA or InstructBLIP, we know the architecture: an image encoder breaks the visual into features, a projector maps them to the language space, and the LLM generates a response. But what happens in the hidden layers between that initial projection and the final answer?

Does the model process a medical image the same way it processes a picture of a cat? Or does it switch into a “medical mode,” utilizing specific pathways in its neural network designed to handle domain-specific knowledge?

In the field of text-only LLMs, researchers have already discovered “language-specific neurons”—components that light up exclusively when the model processes French, Chinese, or English. This discovery revolutionized our understanding of how models handle multilingualism. Now, a new study titled “MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model” applies this forensic lens to vision-language models.

Comparison of language-specific neurons in LLMs versus domain-specific neurons in MLLMs.

As illustrated above, just as a multilingual model has neurons dedicated to specific languages (Figure 1a), the authors of this paper hypothesize that MLLMs possess domain-specific neurons (Figure 1b). These are neurons that activate uniquely when the model encounters visual content from specific fields like Medicine, Remote Sensing, or Autonomous Driving.

In this post, we will tear down the black box of MLLMs. We will explore how the researchers identified these specialist neurons, visualize the “three-stage mechanism” the model uses to understand images, and discuss what this means for the future of AI interpretability.

Background: The Multimodal Landscape

To understand MMNeuron, we first need to contextualize the current state of Multimodal Large Language Models. Representative models like LLaVA-NeXT and InstructBLIP follow a distinct pipeline. They do not train a vision model from scratch; instead, they rely on pre-trained vision encoders (like CLIP or ViT) to extract features from an image. These features are then “projected” into the word embedding space—essentially translating visual data into a language the LLM can understand.

However, “translation” is a metaphor. Mathematically, these are high-dimensional vectors. A key question is whether the visual features from different domains—say, a document versus a driving scene—look different to the model in this space.

PCA visualization of image embeddings extracted through CLIP’s image encoder showing distinct clusters for different domains.

The researchers gathered data from five distinct domains:

  1. Common Scenes: Everyday photos (dataset: VQAv2).
  2. Medical: X-rays, CT scans (dataset: PMC-VQA).
  3. Remote Sensing: Satellite imagery (dataset: RS-VQA).
  4. Documents: Text-heavy images (dataset: DocVQA).
  5. Auto Driving: Dashboard views (dataset: LingoQA).

As shown in the PCA visualization above (Figure 2), the visual features of these domains naturally cluster into distinct groups. The blue cluster (Auto Driving) and green cluster (Remote Sensing) are far apart. This statistical separation suggests that the model should process them differently. The goal of this paper is to find the specific biological machinery—the “neurons”—responsible for this processing.

Core Method: Discovering MMNeurons

The methodology proposed by the authors is a forensic framework designed to identify, quantify, and analyze neurons based on their responsiveness to specific visual domains.

1. The Architecture of Activation

First, let’s formalize the MLLM pipeline. The model takes an image (\(X_v\)), passes it through a vision encoder (\(f_{\Theta}\)), and then through a projector (\(f_{\Pi}\)) to create post-projection features (\(H_v\)).

Equation 1: Formula for post-projection visual features.

These visual features are concatenated with the text instructions (\(H_q\)) and fed into the language model (\(f_{\Phi}\)) to generate an answer (\(X_a\)).

Equation 2: Formula for generating the answer based on visual and language features.

The “neurons” we are interested in reside within the Feed-Forward Networks (FFN) of the Transformer layers. In every layer of a Transformer, there is an FFN consisting of two linear transformations with an activation function (like GELU) sandwiched in between.

Equation 3: Formula for the feed-forward network output.

In this equation, the output of the activation function, \(\operatorname{act\_fn}(h^i W_1^i)\), represents the neuron activation. If the value is positive, the neuron is “firing.” If it is zero or negative (depending on the function), it is silent.

2. The MMNeuron Framework

The researchers propose a method to calculate how “specific” a neuron is to a particular domain. This involves running thousands of images from the five domains through the model and tracking which neurons fire.

The overall framework of the proposed MM-Neuron method.

The process, illustrated in Figure 3, involves three steps:

  1. Activation Detection: Feed domain-specific data and record activations.
  2. Probability Calculation: Determine how often a neuron fires for a specific domain relative to others.
  3. Neuron Selection: Filter for neurons that are highly specialized.

To quantify specificity, they calculate the activation probability (\(p_{u,i}\)) of a neuron \(u\) in domain \(i\). This is simply the frequency of activation divided by the total token number.

Equation 4: Formula for activation probability.

This gives us a distribution vector \(P_u\) for every neuron, representing its behavior across all five domains.

Equation 5: Probability distribution vector for a neuron.

To make this useful, they normalize this vector so it sums to 1, creating a valid probability distribution \(P'_u\).

Equation 6: Normalized probability distribution.

3. Domain Activation Probability Entropy (DAPE)

How do we decide if a neuron is a “specialist” or a “generalist”? A generalist neuron would fire equally for Medical, Driving, and Common images. Its probability distribution would be flat (e.g., [0.2, 0.2, 0.2, 0.2, 0.2]). A specialist neuron would fire almost exclusively for one domain (e.g., [0.0, 0.95, 0.0, 0.05, 0.0]).

To measure this, the authors use Entropy.

Equation 7: Formula for Domain Activation Probability Entropy (DAPE).

DAPE (Domain Activation Probability Entropy) is the core metric.

  • High Entropy: The neuron activates for everything. It is domain-agnostic.
  • Low Entropy: The neuron activates for only one or two domains. It is domain-specific.

The researchers identified neurons with the bottom 1% of DAPE scores as the “Domain-Specific Neurons.”

4. Interpreting Hidden States with Logit Lens

Identifying the neurons is only half the battle. We also want to know what the model is thinking as it processes these features. Usually, we only see the final output token. However, the authors employed a technique called Logit Lens.

General Framework of logit lens analysis.

In a Transformer, the hidden state \(h_l\) at layer \(l\) is usually passed to layer \(l+1\).

Equation 8: Recursive update of hidden states in a transformer.

The Logit Lens technique (Figure 4) takes that intermediate hidden state \(h_l\) and prematurely forces it through the model’s final “unembedding” layer (\(W_U\)). This decodes the hidden state into a word from the vocabulary, effectively asking the model: “If you had to stop thinking right now and guess the next word, what would it be?”

Equation 10: Formula for Logit Lens.

This allows the researchers to visualize the evolution of the model’s understanding layer by layer.

Experiments & Results

The researchers applied MMNeuron to LLaVA-NeXT and InstructBLIP. The findings revealed a fascinating narrative about how these massive models process visual information.

1. The Three-Stage Mechanism

By mapping the distribution of domain-specific neurons across the layers of the Large Language Model, a distinct pattern emerged.

Layer-wise Distribution of domain-specific neurons in different modules.

Looking at the graphs above (Figure 5), particularly for the Language Model Module (bottom left), we see a “U” shape or a drop-then-rise pattern in the number of domain-specific neurons. This supports the authors’ hypothesis of a Three-Stage Mechanism:

  1. Alignment (Early Layers): The model receives projected visual features. These features are still “raw” and require significant domain-specific processing to be aligned with the LLM’s internal representation. We see a high number of domain-specific neurons here.
  2. Generalization (Intermediate Layers): The curve drops. The model has successfully embedded the features into a uniform semantic space. The processing becomes more abstract and general; fewer domain-specific neurons are needed because the concepts are now “universal” to the model.
  3. Task Solving (Late Layers): The curve rises again. The model prepares to generate the specific text response (e.g., answering a medical question). It recalls domain-specific knowledge required to form the correct terminology (like “pneumonia” or “left ventricle”), leading to a resurgence of specific neurons.

2. Domain “Difficulty” and Neuron Count

Not all domains are created equal. The sheer number of neurons dedicated to a domain can indicate how “hard” it is for the model to grasp that concept or how distinct it is from the model’s general training data.

Table 1: The number of neurons in each domain in different modules.

Table 1 shows that for LLaVA-NeXT, the Remote Sensing domain commands the highest number of neurons in the Vision Encoder and LLM. This suggests that satellite imagery, with its unique top-down perspective and specific objects, requires more specialized processing power than common scenes.

Interestingly, for InstructBLIP, Auto Driving dominates the Q-Former and LLM modules. This might reflect the model’s struggle to interpret the complex, dynamic instructions associated with driving scenarios compared to static document analysis.

3. The Impact of Silencing Neurons

To verify that these neurons truly matter, the researchers performed an ablation study. They “silenced” (deactivated) the domain-specific neurons and measured the impact on performance.

Table 2: Accuracy of LLaVA-NeXT and InstructBLIP with deactivated neurons.

The results in Table 2 are nuanced.

  • Performance Drop: Deactivating these neurons does lower accuracy. For example, in LLaVA-NeXT, deactivating Remote Sensing neurons caused accuracy to drop from 42.5% to 38.5% (when looking at the “All” row).
  • Resilience: The drop isn’t catastrophic. The accuracy doesn’t go to zero. This implies the model has redundancy; other generalist neurons can partially compensate.
  • Hidden State Perturbation: While accuracy only dipped slightly, the researchers found that the internal hidden states changed drastically (by over 30% in some cases). This is a critical insight: Current MLLMs do not fully utilize the domain-specific information they possess. The information is there (in the neurons), and removing it changes the model’s internal state, but the final output generation is robust enough to often guess correctly anyway.

4. Visualizing the Thought Process (Case Studies)

Using the Logit Lens, the paper provides a window into the model’s layer-by-layer “thought process.”

Case study of logit lens in InstructBLIP on PMC-VQA.

Consider the medical example in Figure 10 (above). The model is asked about a brain scan.

  • Early Layers (32-24): The model predicts generic tokens related to the image type: “CT,” “scan,” “brain.” It is identifying the domain.
  • Middle Layers (20-10): The predictions shift. It explores related concepts.
  • Late Layers (4-2): The model converges on the specific answer option “B.”

We can also look at the entropy of these predictions. Entropy here measures confusion. If the model is sure the next token is “cat,” entropy is low. If it thinks it could be “cat,” “dog,” or “car,” entropy is high.

Average entropy of next token probability distribution for image and text tokens.

Figure 7 (and the heatmap in Figure 6/21) validates the three-stage theory visually.

  • Start: High entropy (confusion/alignment).
  • Middle: Sharp drop in entropy (Understanding/Generalization).
  • End: Slight rise or stabilization as it selects the specific output word.

Crucially, the entropy for image tokens (dashed lines) is generally higher than for text tokens. This suggests that to an LLM, a visual token is a “sparse mixture of concepts”—it’s not as concrete as a word like “table,” but rather a cloud of visual possibilities that collapses into meaning as it moves through the layers.

Conclusion & Implications

The MMNeuron paper provides the first comprehensive, neuron-level map of how Multimodal Large Language Models process different domains. By adapting techniques from multilingual analysis, the authors uncovered that MLLMs are not monolithic; they contain specialized sub-networks for Medicine, Driving, and Remote Sensing.

The discovery of the Three-Stage Mechanism—Alignment, Generalization, and Task Solving—offers a blueprint for understanding the flow of information in these massive networks.

Perhaps the most practical takeaway is the inefficiency revealed by the ablation studies. The fact that deactivating these specialized neurons drastically changes hidden states but only marginally hurts accuracy suggests that current models are under-utilizing their domain-specific knowledge. They are “jacks of all trades” that have the potential to be masters, but their internal wiring isn’t yet fully optimized to leverage that mastery.

For students and researchers, this opens exciting avenues. If we can better target and amplify these domain-specific neurons, we might be able to build “Cross-Domain All-Encompassing” MLLMs that don’t just survive in specialized fields like radiology or autonomous driving, but thrive in them.