Introduction

When you read the phrase “Her heart was racing,” what do you understand? Depending on the context, she could be terrified of a spider, or she could be looking at the love of her life.

This is the challenge of Embodied Emotion. Emotions aren’t just abstract concepts in our brains; they are physical experiences. We clench our fists in anger, our stomachs churn in disgust, and our eyes widen in surprise. In Natural Language Processing (NLP), detecting explicit emotions (e.g., “I am happy”) is a solved problem. However, detecting the subtle, physical manifestations of emotion—and correctly classifying them—remains a significant hurdle.

A racing heart is ambiguous. A stomping foot is usually anger, but could be a tantrum of frustration. How do we teach machines to distinguish these nuances without massive supervision?

In the paper “CHEER-Ekman: Fine-grained Embodied Emotion Classification,” researchers from the University of Cincinnati tackle this exact problem. They move beyond simply detecting if a body part is expressing emotion, to determining exactly which emotion is being expressed. Their findings are surprising: sometimes, simpler instructions work better than complex technical definitions, and small language models—when prompted correctly—can outperform supervised giants.

Illustration of embodied emotions classified into six categories.

Background: From Binary to Fine-Grained

To understand the contribution of this paper, we need to look at where the field stood previously. The concept of embodied emotion is rooted in cognitive science, suggesting that our emotional experiences are deeply tied to our bodily states.

Prior to this work, the state-of-the-art dataset was CHEER (Zhuang et al., 2024). The CHEER dataset focused on a binary task: Embodied Emotion Detection. It asked models to look at a sentence like “He tapped his fingers on the table” and decide: Is this an emotional expression? (Yes/No).

While useful, this binary approach has a major limitation. Knowing that a “racing heart” is emotional is only half the battle. If an AI assistant detects a user is emotional but can’t tell the difference between Fear and Joy, its response will likely be inappropriate.

The researchers behind CHEER-Ekman decided to bridge this gap by mapping these physical sensations to Ekman’s Six Basic Emotions: Joy, Sadness, Anger, Disgust, Fear, and Surprise.

The CHEER-Ekman Dataset

The first contribution of this paper is the data itself. The authors took the 1,350 positive samples from the original CHEER dataset—sentences where a body part was expressing an emotion—and annotated them with specific emotion labels.

This was not a trivial task. Two annotators were given the sentence, the specific body part involved, and the preceding context. They had to decide which of the six Ekman emotions best fit the physical description.

Table 1: Examples in our CHEER-Ekman datasets.

As shown in Table 1 above, the connections can be vivid. “Frowning and scuffing his feet” maps clearly to Sadness, while “eyes almost fell out of my head” is a classic hyperbole for Surprise.

The resulting distribution of the dataset highlights the complexity of human expression in text.

Figure 2: CHEER-Ekman dataset distribution of emotions.

Fear (24.7%) and Joy (21.2%) are the most distinct and frequent categories, likely because physiological reactions like shaking (fear) or smiling (joy) are very common in literature. Anger (9.0%) appears least frequently in this specific embodied context, perhaps because anger is often expressed through dialogue or aggression rather than purely internal bodily sensations.

Core Method: Teaching LLMs to “Feel”

With the dataset in hand, the researchers faced a classification problem. How do you get a model to reliably predict these labels? They explored two main avenues: Prompt Engineering and Best-Worst Scaling (BWS).

1. The Paradox of Prompting

When using Large Language Models (LLMs) like Llama-3.1 and DeepSeek, the intuitive approach is to give the model a precise, technical definition of the task. The authors started with a “Base” prompt that technically defined embodied emotion involving physiological arousal and lack of other purposes.

However, they discovered something counterintuitive: Simpler is better.

They created a “Simple” prompt that stripped away the academic jargon. Instead of defining “physiological arousal,” they simply asked: Did emotion cause the body part’s movement?

Table 5: Zero-shot templates for different tasks. Table 6: Chain of Thought (CoT) prompt templates for different settings.

The results were dramatic. Using simplified language (“everyday English”) significantly outperformed the complex technical prompts. It seems that LLMs, having been trained on vast amounts of general internet text, resonate better with natural language instructions than rigid academic definitions.

Chain-of-Thought (CoT) Reasoning

To further boost performance, especially for smaller models (like the 8B parameter versions), the authors implemented Chain-of-Thought prompting. They broke the reasoning process down into steps:

  1. Identification: Identify the body part.
  2. Causality: Did emotion cause the movement?
  3. Purpose: Was the movement only for emotion?

This structured reasoning allowed an 8B parameter model to achieve results competitive with models nearly 10 times its size (70B), proving that how you ask the model to think is just as important as the model’s size.

2. Best-Worst Scaling (BWS): A Comparative Approach

The most innovative part of their methodology was the use of Best-Worst Scaling (BWS) for classification.

In a standard zero-shot classification task, you would give the LLM a sentence and ask: “Which emotion is this?” The model outputs a label. However, LLMs can be inconsistent with direct labeling. They might hallucinate or struggle with the subtle boundaries between Fear and Surprise.

The authors instead treated this as a ranking problem.

How BWS Works

  1. The model is presented with a tuple of 4 sentences (e.g., Sentence A, B, C, D).
  2. It is asked: “Which of these sentences MOST represents Joy?” and “Which LEAST represents Joy?”
  3. This is repeated for all six emotions.

By forcing the model to compare sentences against each other rather than judging them in isolation, the model provides much more reliable signals.

The score for a sentence (\(e_i\)) is calculated using this formula:

Equation for BWS scoring

Here, the score is derived by taking the number of times a sentence was chosen as “Best” minus the times it was chosen as “Worst,” normalized by the total number of comparisons. The emotion with the highest score becomes the predicted label for that sentence.

Experiments & Results

The researchers compared their LLM approaches against a fine-tuned BERT model (a standard supervised learning baseline). The results highlight the power of their new methods.

Detection Results: The Power of Simplicity

First, looking at the binary detection task (Is this an emotion?), the impact of simplified prompts was undeniable.

Table 3: CoT results for Embodied Emotion Detection.

As seen in the table, the Simple prompts (indicated by the -simple subscript) consistently outperformed the standard prompts. For example, DeepSeek-2-step-simple achieved a significantly higher F1 score than the standard DeepSeek-2-step. This confirms that reducing linguistic complexity lowers the barrier for the model to “understand” the task.

Classification Results: BWS vs. Supervised Learning

The main event was the fine-grained classification (Which emotion is it?). Here, the Best-Worst Scaling (BWS) method shone brightly.

Table 4: Results for Emotion Classification.

Take a look at Table 4. The Zero-shot Llama achieved an F1 score of 31.6. However, when using BWS with 36N tuples (meaning the model performed many comparisons), the score jumped to 50.6.

Crucially, BWS (50.6) outperformed the Supervised BERT model (49.6).

This is a significant finding. It means that an LLM, without any specific training on this dataset (zero-shot inference using BWS), can outperform a smaller model that was explicitly trained on the data. This opens the door for high-quality emotion classification in scenarios where we don’t have enough data to train a supervised model.

Scaling the Comparisons

One question the researchers asked was: “How many comparisons do we need?”

Figure 7: F1-score trends for BWS when increasing the number of tuples

The graph above shows the performance of BWS as the number of tuples (\(N\)) increases. The red dashed line is the supervised BERT baseline. You can see the blue BWS line climbing steadily, surpassing BERT at around 36N. However, it plateaus and drops slightly afterward, suggesting there is an optimal “sweet spot” for the number of comparisons before the returns diminish.

What Body Parts Tell Us

Finally, the authors analyzed which body parts are most predictive of emotions.

Figure 3: Frequency of top 10 body parts for each emotion.

The bubble chart reveals that the Face, Eyes, and Head are the universal communicators of emotion. However, distinct patterns emerge:

  • Heart is heavily associated with Fear (racing, pounding).
  • Mouth and Lips lean toward Joy (smiling) and Surprise (gaping).
  • Throat appears frequently in Sadness (lump in throat) and Fear.

Conclusion and Implications

The “CHEER-Ekman” paper provides a fascinating leap forward in how we interpret the “body language” of text. By creating a fine-grained dataset and applying clever prompting techniques, the researchers have demonstrated that machines can learn to distinguish between the racing heart of a lover and the trembling hands of a coward.

Key Takeaways:

  1. Data Matters: The new CHEER-Ekman dataset fills a critical gap, allowing for the study of specific embodied emotions rather than just general emotional presence.
  2. Keep It Simple: In prompt engineering, specialized technical jargon often hurts performance. Plain, everyday language helps the model reason better.
  3. Comparisons > Absolutes: The Best-Worst Scaling (BWS) technique proves that LLMs are much better at comparative reasoning (A is more joyful than B) than absolute classification, allowing them to beat supervised models without training.

This research moves us one step closer to AI that doesn’t just read what we say, but understands how we feel—right down to the butterflies in our stomachs.