](https://deep-paper.org/en/paper/2406.11193/images/cover.png)
Inside the Mind of Multimodal Models: Tracking Domain-Specific Neurons with MMNeuron
Introduction How does a large language model (LLM) “see” an image? When we feed a photograph of a chest X-ray or a satellite view of a city into a Multimodal Large Language Model (MLLM) like LLaVA or InstructBLIP, we know the architecture: an image encoder breaks the visual into features, a projector maps them to the language space, and the LLM generates a response. But what happens in the hidden layers between that initial projection and the final answer? ...
](https://deep-paper.org/en/paper/2401.02906/images/cover.png)
](https://deep-paper.org/en/paper/file-3337/images/cover.png)
](https://deep-paper.org/en/paper/2406.10701/images/cover.png)
](https://deep-paper.org/en/paper/2407.15272/images/cover.png)
](https://deep-paper.org/en/paper/2411.06616/images/cover.png)
](https://deep-paper.org/en/paper/2311.08562/images/cover.png)
](https://deep-paper.org/en/paper/2407.12196/images/cover.png)
](https://deep-paper.org/en/paper/2410.03531/images/cover.png)
](https://deep-paper.org/en/paper/file-3330/images/cover.png)
](https://deep-paper.org/en/paper/2410.09220/images/cover.png)
](https://deep-paper.org/en/paper/file-3327/images/cover.png)
](https://deep-paper.org/en/paper/file-3326/images/cover.png)
](https://deep-paper.org/en/paper/2407.07071/images/cover.png)
](https://deep-paper.org/en/paper/2410.18050/images/cover.png)
](https://deep-paper.org/en/paper/2404.12096/images/cover.png)
](https://deep-paper.org/en/paper/file-3322/images/cover.png)
](https://deep-paper.org/en/paper/2401.00757/images/cover.png)
](https://deep-paper.org/en/paper/2410.04282/images/cover.png)
](https://deep-paper.org/en/paper/2410.17739/images/cover.png)