Can AI Understand How You Feel? Evaluating Vision-Language Models with the Cast of ‘Friends’
Emotional Intelligence (EI) is often considered the final frontier for Artificial Intelligence. We have models that can write code, compose poetry, and pass bar exams, but can they understand the subtle sigh of a disappointed friend or the sarcastic eye-roll of a colleague?
For a long time, researchers focused on text-based Large Language Models (LLMs) to answer this question. Studies showed that models like GPT-4 possess a surprisingly high “Emotional Quotient” (EQ) when analyzing text. But human communication is rarely just text. It is a complex symphony of words, facial expressions, body language, and environmental context. To truly possess Emotional Intelligence, an AI must see as well as read.
This brings us to Vision Large Language Models (VLLMs). These systems process visual and textual data simultaneously, theoretically allowing them to read the room better than their text-only counterparts. But do they? And what factors actually drive their performance?
In a fascinating new study, researchers from Konkuk University set out to deconstruct the emotional intelligence of VLLMs. To do this, they didn’t use dry laboratory data. Instead, they reconstructed a dataset based on the iconic TV sitcom Friends. By analyzing how AI interprets the lives of Ross, Rachel, Joey, and the gang, the researchers uncovered critical insights about model architecture, the role of personality, and the hidden biases lurking within these systems.
The Challenge of Multimodal Emotion
Before diving into the experiments, it is vital to understand why this is a hard problem. In text-based sentiment analysis, keywords often give the game away. Words like “happy,” “terrible,” or “love” are strong indicators. However, in a real conversation, a person might say, “I’m fine,” while looking absolutely devastated.
A VLLM must align two different “modalities”:
- Textual Modality: The dialogue history and current utterance.
- Visual Modality: The facial expressions, gestures, and surroundings.
If the model relies too much on text, it misses the sarcasm or the hidden sadness. If it relies too much on the image, it might misinterpret a neutral resting face as anger. The research paper, titled “Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts,” aims to systematically isolate the variables that help—or hurt—this delicate balancing act.
Methodology: Reconstructing Friends for AI
To test VLLMs, the authors utilized the MELD dataset, a famous resource in the NLP community derived from Friends. However, the original dataset wasn’t sufficient for a deep-dive into visual contexts. The researchers performed a comprehensive reconstruction of the data to create a robust testing ground.
1. Data Construction Pipeline
The researchers didn’t just feed raw video into the models. They curated the data through a three-stage process to ensure quality and relevance.

As illustrated in the figure above, the process involved:
- Dialogue Selection: They filtered conversations to ensure they weren’t too long (which would force the model to rely only on text history) or too short. Crucially, they prioritized dialogues involving characters with definable “personas.”
- Image Scope Reconstruction: This is a key innovation of the study. Real-world visual cues exist at different levels of zoom. The researchers extracted images at three specific “scopes”:
- Entire Scene: Captures the environment and the interaction between characters.
- Person: Focuses on the speaker’s body language and posture.
- Facial Expression: A close-up on the face to capture micro-expressions.
- Incorrect Sentence Selection: To test the models, they created multiple-choice tasks. Using SBERT (Sentence-BERT), they generated “distractor” answers—responses that were plausible but incorrect, requiring the model to truly understand the emotional context to pick the right one.
2. The Three Tasks
Instead of simply asking the model “Is this person happy?”, the researchers designed three distinct cognitive tasks to test different facets of Emotional Intelligence:
- Overall Emotion Tone Prediction: The model must identify the general sentiment (Positive, Negative, Neutral) of the scene and select the appropriate response.
- Character Emotion Prediction: A more granular task where the model must identify specific emotions (Joy, Fear, Anger, Sadness, etc.).
- Contextually Appropriate Emotion Expression Selection: The hardest task. The model is given options that all convey the same emotion (e.g., all are happy sentences) but must choose the one that fits the specific context and speaking style of the character.
3. Injecting Persona
Does knowing who is speaking help the AI understand how they feel? The researchers enriched the dataset with persona information. For example, if the model knows that Chandler uses sarcasm as a defense mechanism, or that Monica is high-strung, does it predict their emotions better? They tested prompts that included:
- Personality Traits: Descriptions of the character’s internal nature.
- Speaking Styles: Descriptions of how the character typically constructs sentences.
Key Findings: What Drives Performance?
The researchers tested a variety of open-source VLLMs, including InstructBLIP, LLaVA, and MiniGPT-4, utilizing different Large Language Model backbones (like Vicuna and FLAN). The results offered several surprises.
The Brain Matters More Than the Eyes
The single most influential factor in performance was not the visual encoder, but the LLM backbone.
Models using FLAN (specifically InstructBLIP with FLAN 11B) consistently outperformed those using Vicuna or Llama. This suggests that while visual data is important, the model’s fundamental reasoning capability (derived from the text-based LLM) acts as the engine for emotional intelligence. As the size of the LLM increased, performance improved linearly, proving that “scaling laws” apply to emotional understanding just as they do to math or coding.
The Complexity of Prompts
Did telling the AI about a character’s personality help? The answer is: It depends on the emotion.

The chart above breaks down accuracy by emotion across different prompting strategies (Original, Personality, Speaking Styles, and Chain-of-Thought).
- Joy: The inclusion of Speaking Styles (the grey bar) significantly boosted the detection of “Joy.” Knowing how a character talks helps the model identify positive wit and banter.
- Fear: Performance dropped for “Fear” across all specialized prompts. Fear is often a visceral, immediate reaction to a situation, meaning a character’s long-term personality is less relevant than the immediate visual context.
- Neutral: Predicting “Neutral” emotions was notoriously difficult, and adding personality traits actually hurt performance (lowest accuracy). It seems that adding extra information leads the model to “overthink” and hallucinate an emotion where there isn’t one.
The Importance of “Visual Scope”
One of the study’s most unique contributions is the analysis of where the model should look. Should it look at the face, the body, or the whole room?

The results, shown above, reveal that different emotions live in different parts of an image:
- Sadness (Face): The “Face” scope (green bar) was superior for detecting sadness. Sadness is an internal emotion often conveyed through subtle facial muscle movements (a frown, downcast eyes).
- Fear (All): The “Entire Scene” scope (blue bar) was best for fear. Fear is usually a reaction to an external threat. Seeing the “scary thing” or the distance between characters helps the AI understand why someone is afraid.
- Surprise (Person): The “Person” scope (orange bar) won here. Surprise is often physical—a jump, a recoil, a hand to the chest—which requires seeing the upper body, not just the face.
The Dark Side: Bias in Emotional AI
Perhaps the most critical section of the paper deals with what the models get wrong, and specifically why they get it wrong. The researchers uncovered significant gender and regional biases embedded in these VLLMs.
Gender Bias and the “Disgust” Discrepancy
When breaking down performance by gender, a strange pattern emerged. The models were generally better at predicting emotions in female characters than male characters.

As seen in the radar chart, the most striking difference is in the emotion of Disgust. The model’s accuracy for detecting disgust in females (red line) is massively higher than for males (blue line)—a gap of nearly 20%.
Why? The researchers suggest this is due to stereotypical training data. In many datasets (and perhaps in the show Friends itself), female characters may express disgust more overtly (“Ewww!”, expressive faces), while male characters might express it through sarcasm or stoicism.
Let’s look at the examples the authors analyzed:

In the female examples above, the expression of disgust is explicit and visually dramatic. Phoebe (left) and Rachel (right) use strong facial cues and exclamations like “Oh my God.”

In contrast, the male examples above show Ross and Chandler expressing disgust through deadpan humor or sarcasm. Ross (left) talks about breast milk with a uncomfortable face, but it’s less explosive. Chandler (right) makes a sarcastic joke about a mute button. The VLLMs struggle to categorize these subtle, “male-coded” expressions as disgust, revealing a gender gap in the AI’s emotional understanding.
Regional Bias: The “North American” Standard
The researchers also tested how the models handled Persona Information related to geographic regions. They modified the prompts to tell the model that the speaker was from a specific region (e.g., “The speaker has lived in the Middle East…”).
Since the dialogue and images remained exactly the same, the performance should theoretically remain stable. It did not.

The results were stark. When the model was primed with a North American persona (green bar), performance improved slightly. However, for every other region, performance degraded.
- Middle East & Africa: The biggest drops (-2.40% and -2.20%) occurred here.
- East Asia & South Asia: Significant performance penalties were also observed.
This indicates that the models harbor deep-seated stereotypes. When told a character is from a non-Western region, the model essentially “forgets” how to read the universal emotions of Friends characters, likely applying incorrect cultural assumptions about how people from those regions express feelings. This is a “hallucination of bias” that could have serious implications if these models are deployed globally.
Sentiment Analysis Overview
Finally, looking at the broader picture of sentiment (Positive vs. Negative vs. Neutral), the researchers found that Persona information is a double-edged sword.

- Positive (Left group): Personality and Speaking Styles (orange and grey bars) help the model identify positive sentiments better than the original prompt.
- Neutral (Right group): The original prompt (blue bar) or Chain-of-Thought (yellow bar) works best. Adding personality details makes the model bad at identifying neutrality.
Conclusion and Future Implications
This study by Lee et al. provides a comprehensive check-up on the state of Emotional Intelligence in Vision-Language Models. The good news is that VLLMs are capable of sophisticated emotion recognition, provided they have a strong LLM backbone and are looking at the right visual scope (Face vs. Scene).
However, the findings on bias are a wake-up call. The fact that models are significantly worse at understanding men’s disgust or the emotions of people labeled as non-Western suggests that our current training data is heavily skewed.
Key Takeaways for Students:
- Architecture: When building multimodal systems, your language model is still your powerhouse. A better “brain” (LLM) improves the “eyes” (Vision).
- Context is Key: There is no “one size fits all” for visual analysis. If you are building a fear-detection system, look at the scene. If you are building a sadness-detector, look at the face.
- Bias is Invisible but Measurable: You might not see bias in overall accuracy scores. You have to slice your data by gender, region, and emotion to find the blind spots.
As we move toward Artificial General Intelligence (AGI), we want systems that don’t just understand what we say, but how we feel. This paper brings us one step closer to that reality, while highlighting the ethical guardrails we need to build along the way.
](https://deep-paper.org/en/paper/file-2750/images/cover.png)