If you were asked to describe how you feel after a long, difficult week that ended with a small victory, you probably wouldn’t just say “happy” or “sad.” You might say you feel relieved, drained, accomplished, or bittersweet.
Human emotion is high-dimensional and nuanced. Yet, for years, Natural Language Processing (NLP) has treated emotion analysis as a simple sorting task. Most systems try to force complex human sentences into a tiny set of boxes—usually the “Basic Six” proposed by psychologist Paul Ekman (Anger, Disgust, Fear, Joy, Sadness, Surprise).
While this categorical approach is easy to engineer, it fails to capture the reality of how we write about our feelings. A new paper from Columbia University, titled “MASIVE: Open-Ended Affective State Identification in English and Spanish,” proposes a radical shift. Instead of classifying text into a handful of buckets, the researchers introduce a method to identify a practically unbounded set of “affective states.”
In this post, we will break down how they built a dataset of over 1,000 emotional labels, why small models are beating massive LLMs at this task, and why machine translation might be ruining your multilingual analysis.
The Problem: The “Basic Emotion” Bottleneck
The dominant paradigm in emotion analysis is classification. You take a tweet or a review, feed it into a model, and the model outputs a label like “Positive” or “Angry.”
The problem is that these label sets were rarely designed for text. The Ekman emotions, for example, were originally defined based on facial expressions, not written language. When we apply these rigid categories to text, we lose nuance. Furthermore, this approach is often Anglocentric—assuming that English emotion categories map perfectly onto other languages and cultures.
The researchers argue for a descriptive approach. rather than asking “Which of these 6 emotions is this?”, they ask the model to predict the exact words an author would use to describe their state.

As shown in Figure 1, a standard model (Ekman Output) might label a complex sentence simply as “happy.” However, the text actually conveys feelings of being valued and seen. In the Spanish example, the generic label “feliz” (happy) completely misses the nuance of feeling querido (loved), aceptado (accepted), or desesperado (desperate).
To solve this, the authors define a new task: Affective State Identification (ASI).
Core Method: How to Catch a Feeling
The goal of ASI is to predict specific words—called affective states—that describe an emotional experience. These include emotions (short-term), moods (long-term), and even figurative expressions (like feeling “blue”).
To train a model to do this, you need data. But you can’t just hand-label thousands of distinct emotions; it would be too expensive and subjective. Instead, the researchers used a clever bootstrapping technique to scrape data from Reddit.
The Bootstrapping Loop
The researchers didn’t start with a list of 1,000 emotions. They started with just the basics (Ekman seeds like “happy”, “sad”, “angry”) and let the data teach them the rest.

As illustrated in Figure 2, the process works in four steps:
- Seed Terms: Start with basic adjectives (e.g., “happy”).
- Query Templates: Search Reddit for posts containing phrases like “I feel [seed] and…” or “I don’t feel [seed] and…”.
- Extraction: When a post is found, look at the word connected to the seed. If someone writes “I feel happy and proud,” the system extracts “proud.”
- Expansion: “Proud” becomes a new seed term. The process repeats, searching for “I feel proud and…” to find new words.
This method assumes that if a word is used as an adjective alongside an emotion, it is likely an affective state itself.
The MASIVE Dataset
This process resulted in MASIVE (Multilingual Affective State Identification with Varied Expressions). It is a massive corpus of Reddit posts containing:
- English: ~93,000 training examples with 1,627 unique affective states.
- Spanish: ~31,000 training examples with 1,002 unique affective states.
The dataset captures the “long tail” of emotion—words that are rarely found in standard datasets but are essential for understanding human experience, such as giddy, euphoric, grumpy, terrified, disconnected, or empty.
Crucially, the dataset includes negations (“I don’t feel…”) and grammatical gender in Spanish (masculine/feminine adjectives), pushing models to understand linguistic structure, not just keywords.

Table 9 above provides examples of the raw data. Notice how the texts are not simple sentences; they are complex narratives about waiting lists, relationships, and customer service frustrations. The models must infer feelings like discouraged or cheated from these dense contexts.
Experiments: David vs. Goliath
The researchers tested how well different AI models could perform the ASI task. They framed it as a Masked Span Prediction task.
They took the sentences, hid the affective state word (e.g., “I feel [MASK]”), and asked the model to fill in the blank. They compared two types of models:
- Small, Fine-Tuned Models: T5 and mT5 (Multilingual T5). These are older, smaller models fine-tuned specifically on the MASIVE dataset.
- Large Language Models (LLMs): Llama-3 and Mixtral. These are massive, state-of-the-art generative models used in a zero-shot setting (prompted without specific training).
Metrics: Accuracy and Similarity
Grading a model on this is hard. If the correct word is “furious” and the model guesses “angry,” strictly speaking, it is wrong. But semantically, it is close.
To handle this, the researchers used two main metrics:
- Top-k Accuracy: Did the correct word appear in the model’s top 1, 3, or 5 guesses?
- Top-k Similarity: How semantically close were the model’s guesses to the correct word?
They calculated similarity using contextual embeddings, as shown in the equation below. Essentially, they checked if the vector of the predicted word was close to the vector of the actual word in vector space.

Result 1: Small Specialists Beat Large Generalists
The results were surprising for those riding the LLM hype train.

As Table 4 shows, the fine-tuned mT5 model significantly outperformed the massive Llama-3 and Mixtral models.
- In English,
mT5achieved 17.91% accuracy (Top-1), while Llama-3 achieved only 1.29%. - In Spanish,
mT5reached 24.51% accuracy, compared to Llama-3’s 2.52%.
Why? LLMs are generalists. Without specific fine-tuning, they struggle to pinpoint the exact affective vocabulary an author would use, often generating generic responses or failing to follow the strict formatting of the task. The smaller model, trained specifically on the distribution of Reddit emotional language, became an expert in this domain.
Result 2: Learning MASIVE Helps with Basic Emotions
One might worry that training on 1,000+ loose labels would confuse a model when it needs to do standard classification. The researchers tested this by taking their MASIVE-trained model and fine-tuning it again on standard emotion benchmarks (like GoEmotions, which uses fixed categories).

Table 5 demonstrates a clear transfer learning benefit. The rows labeled \(mT5^{MAS}\) (the model pre-trained on MASIVE) generally show higher F1 scores (a balance of precision and recall) than the vanilla mT5 model.
This suggests that learning the nuances of “giddy,” “content,” and “elated” helps the model understand the broader concept of “joy” better than if it had only ever seen the label “joy.”
Deep Dive: Linguistic Nuance and Translation
The study went beyond simple accuracy numbers to look at how models handle the complexities of language.
Gender and Negation
Spanish is a gendered language. If an author writes “Estoy [MASK]” (I am…), the adjective must match the gender of the speaker.

Figure 3 (Left) shows that the fine-tuned mT5 model (the red and blue bars) performs consistently well across feminine and masculine adjectives. The LLMs (green bars), however, show a bias toward masculine forms or fail to align gender correctly.
The charts on the right show negation. Handling “I am not happy” is tricky for AI, which often ignores the “not” and focuses on “happy.” The fine-tuned models successfully learned to predict negative states (predicting “sad” when the input is “not happy”), further validating the dataset quality.
The Failure of Machine Translation
In multilingual NLP, a common shortcut is “Translate-Train” or “Translate-Test.” If you don’t have Spanish data, you translate your English data to Spanish to train the model, or you translate your Spanish test data to English to evaluate it.
The researchers tested this hypothesis and found it lacking.
- Native Data is King: Models trained on native Spanish texts (written by Spanish speakers) vastly outperformed models trained on translated English texts.
- Translation Artifacts: Machine translation often standardizes emotion. It might translate five different nuances of “sadness” into the same generic Spanish word triste. This destroys the very granularity ASI tries to capture.
This finding is a strong warning for NLP practitioners: You cannot rely on Google Translate to build high-quality cultural or emotional AI. You need data authored by native speakers.
The Challenge of Regional Dialects
Finally, the researchers collected a “Challenge Set” of regional Spanish terms (slang and dialects from Mexico, Spain, Venezuela, etc.).

While the models performed well on standard language, performance dropped significantly on unseen or regional terms (as seen in the “Unseen” rows in Table 6). This highlights a remaining gap in the field: current models are still biased toward “standard” or dominant dialects and struggle with the rich diversity of regional emotional expression.
Conclusion: The Future of Affective AI
The MASIVE paper makes a compelling case that we need to stop putting human emotions into boxes. By shifting from classification (picking a category) to identification (generating the descriptive word), we can build AI that understands the difference between being angry and being indignant.
Key takeaways for students and researchers:
- Nuance Matters: Training on open-ended vocabularies improves performance on standard tasks.
- Small Models Can Win: For specific, nuanced tasks, a fine-tuned T5 often beats a zero-shot LLM. Don’t assume bigger is always better.
- Culture Cannot be Translated: To capture true emotional sentiment in other languages, you need native data. Translation erases the cultural fingerprint of emotion.
As AI becomes more integrated into mental health support, customer service, and social analysis, the ability to distinguish grief from despair, or joy from relief, will be the difference between a robotic response and true understanding.
](https://deep-paper.org/en/paper/2407.12196/images/cover.png)