Social media platforms like X (formerly Twitter) are the modern world’s town squares. They are where news breaks, trends are born, and daily lives are documented. However, this town square is global, chaotic, and incredibly noisy. For researchers, data scientists, and companies, making sense of this data—organizing it into coherent topics—is a massive challenge.
While we have decent tools for classifying English content, the rest of the world is often left behind. Traditional methods struggle with the linguistic diversity of global platforms, and existing datasets are often limited to specific domains like news or lack the informal nuances of social media text.
In this deep dive, we are unpacking a significant research paper: “Multilingual Topic Classification in X: Dataset and Analysis.” The researchers behind this work have introduced X-Topic, a high-quality, multilingual dataset designed to benchmark how well artificial intelligence can understand tweets in English, Spanish, Japanese, and Greek.
We will walk through how they built this dataset, the unique challenges of multilingual classification, and how modern Large Language Models (LLMs) like GPT-4o compare to specialized fine-tuned models when put to the test.
The Problem with Current Topic Classification
Before we dive into the solution, we need to understand the problem. If you want to analyze what people are talking about on social media, you generally have two paths: Unsupervised and Supervised learning.
The Unsupervised Route
Unsupervised approaches, like Latent Dirichlet Allocation (LDA) or BERTopic, attempt to find patterns in text without being told what to look for. Imagine dumping a million tweets into a bucket and asking an algorithm to “sort these into piles.”
- Pros: You don’t need labeled data.
- Cons: The “piles” the algorithm creates are often messy. You might get a topic that mixes “cooking” with “politics” just because certain words co-occur. The results are hard to interpret and difficult to compare across different studies.
The Supervised Route
Supervised learning involves training a model on examples that humans have already labeled.
- Pros: High accuracy and clear, interpretable categories (e.g., “Sports,” “Politics”).
- Cons: You need a lot of high-quality labeled data.
The issue is that most existing supervised datasets are essentially “English News” datasets (like BBC News or Reuters). Social media text is different—it’s short, slang-heavy, and full of emojis. Furthermore, resources for languages like Greek are scarce compared to English.
Introducing X-Topic
To bridge this gap, the researchers created X-Topic. This is not just a collection of tweets; it is a carefully curated benchmark designed to test the limits of multilingual understanding.
The dataset focuses on four languages:
- English (en): The dominant language of NLP research.
- Spanish (es): A widely spoken global language.
- Japanese (ja): Linguistically distant from English, using a completely different writing system.
- Greek (gr): A “lower-resource” language that is less frequently studied in computational linguistics.
The Taxonomy
To classify the tweets, the team used a taxonomy of 19 topics originally proposed in previous work (TweetTopic). These range from “Arts & Culture” and “Politics” to “Diaries & Daily Life.”

As shown in Table 1 above, the classification is multi-label. This is crucial because social media content is rarely one-dimensional. A tweet about a Taylor Swift concert isn’t just about “Music”; it’s also about “Celebrity & Pop Culture.” A tweet about a date at a museum touches on “Relationships,” “Arts & Culture,” and “Diaries & Daily Life.”
Constructing the Benchmark
Creating a high-quality dataset is more science than art. The researchers didn’t just scrape random tweets; they followed a rigorous pipeline to ensure the data represented reality.
1. Collection and Sampling
Unlike previous datasets that used keywords (e.g., searching for “football” to find sports tweets), X-Topic used a random sampling approach. They pulled 50 tweets every two hours for each language over a year-long period (September 2021 to August 2022).
Why does this matter? Keyword-based collection introduces bias. If you only search for specific words, you only find what you’re looking for. Random sampling captures the true distribution of what people actually talk about.
2. Preprocessing and Filtering
Raw social media data is noisy. The team started with roughly 220,000 tweets per language but whittled this down significantly.

As detailed in the table above (top section), the filtering process was aggressive:
- Language Detection: Ensuring a “Spanish” tweet is actually Spanish.
- Quality Control: Removing incomplete sentences or abusive content.
- Near-Duplication: Removing retweets or copy-pasted content to ensure variety.
- Privacy: All user mentions were masked with
{USER}to protect privacy.
From the cleaned pool, they sampled 1,000 tweets per language to be annotated. To ensure they were annotating interesting content, they weighted the sample based on popularity (retweets and followers), assuming that widely shared content is generally higher quality.
3. Human Annotation
This is the gold standard of the dataset. They didn’t use AI to label the data; they used humans. Specifically, they used the platform Prolific.co rather than Amazon Mechanical Turk, ensuring better fluency in the target languages.
Five different annotators looked at each tweet. A topic was only assigned if at least two annotators agreed on it. This “inter-annotator agreement” is a vital metric for quality.

Table 2 highlights that agreement (Alpha) was generally low (around 0.23-0.26). This isn’t a failure of the annotators; rather, it reflects the subjectivity of social media. Deciding whether a tweet is “Diaries & Daily Life” or “Family” can be ambiguous. However, these scores are on par with other complex emotion/sentiment datasets.
Analyzing the Data: What Do We Talk About?
Once the dataset was built, the researchers analyzed the distribution of topics. The results reveal cultural similarities and differences.

As seen in Figure 1, one topic dominates across all four languages: Diaries & Daily Life. This confirms that despite X being a platform for news and politics, its primary function for many users is still a digital journal.
However, cultural nuances appear in the secondary topics:
- In English, Spanish, and Greek, the second most popular topic was News & Social Concern.
- In Japanese, the second most popular topic was Other Hobbies.
This suggests that the usage of the platform varies by region—some cultures use it more for news consumption, while others use it for hobbyist communities.
Topic Overlap
Because this is a multi-label dataset, it allows us to see how topics interact.

The heatmap above (Figure 2) illustrates these connections. Strong correlations exist between:
- Music and Celebrity & Pop Culture (45% overlap).
- Family and Diaries & Daily Life (79% overlap).
These overlaps make intuitive sense but also highlight why the classification task is so difficult for machines—the boundaries between topics are fluid.
The Experiments: Man vs. Machine vs. Machine
With the X-Topic benchmark established, the researchers conducted extensive experiments to see which AI models could best handle this multilingual challenge.
The Contenders
They compared two main categories of models:
- Fine-Tuned Models: These are pre-trained models (like BERT or RoBERTa) that are then further trained (fine-tuned) specifically on this dataset.
- XLM-R: A massive multilingual model trained on 100 languages using Common Crawl data.
- XLM-T: A version of XLM-R that was further trained on millions of tweets. This gives it domain-specific knowledge.
- Bernice: Another Twitter-specific model.
- Zero/Few-Shot LLMs: These are massive generative models that are not trained on this specific dataset. They are simply given a prompt (instructions) and asked to classify the tweet.
- BLOOMZ & mT0: Open-source multilingual models.
- ChatGPT (GPT-3.5) & GPT-4o: The leading commercial LLMs from OpenAI.
The Setup
They tested these models in various settings:
- Monolingual: Training on Spanish data and testing on Spanish data.
- Cross-lingual: Training on English data (the older TweetTopic dataset) and testing on Spanish/Japanese/Greek. This tests if the model can “transfer” knowledge across languages.
- Multilingual: Training on data from all languages combined.
Key Results: What Works Best?
The results, summarized in Table 3, offer several critical insights for the field of NLP.

1. Domain Specificity is King
If you look at the fine-tuned section of Table 3, you will see that XLM-T consistently outperforms XLM-R.
- Why? XLM-R was trained on general web data (Wikipedia, Common Crawl). XLM-T was adapted using tweets. It understands hashtags, mentions, and the informal grammar of social media.
- The Lesson: Even massive multilingual models benefit significantly from being adapted to the specific domain (social media) they are analyzing.
2. Multilingual Training Boosts Performance
Models trained on “All” languages (the Multilingual setting) performed roughly 17 points better (Macro-F1) than those trained only on a single language.
- Why? This is the power of Cross-Lingual Transfer. The model learns what a “Sports” tweet looks like in English and Spanish. When it sees a Greek tweet about sports, it can leverage patterns it learned from the other languages to make a better prediction, even if the Greek training data is sparse.
3. Fine-Tuning vs. LLMs
Here is perhaps the most interesting finding for modern practitioners. Despite the hype surrounding Large Language Models:
- Fine-tuned models (like XLM-T) generally outperformed LLMs (like GPT-4o).
- The best fine-tuned model achieved a Macro-F1 of roughly 60-74% depending on the language.
- GPT-4o in a few-shot setting (given 5 examples) achieved decent scores (around 50-60%) but lagged behind the specialized models.
However, GPT-4o showed remarkable consistency. While smaller open-source LLMs (BLOOMZ) failed catastrophically in non-English languages, GPT-4o maintained respectable performance across Japanese and Greek, proving it has strong multilingual generalization capabilities even without specific training.
4. The “English Bias”
Almost all models performed best on English and worst on Japanese or Greek. This highlights that “Multilingual” models are often still “English-centric” under the hood. The unique script and linguistic structure of Japanese, combined with the lower volume of Greek training data in pre-training corpora, make them harder targets.
Error Analysis: Where Do Machines Fail?
To understand why the models failed, the researchers looked at the specific topics causing trouble.

Table 4 shows the topics with the highest False Negative rates (where the model failed to detect the topic).
- The Hardest Topics: “Arts & Culture” and “Other Hobbies.”
- The Reason: These are broad, diverse categories. “Other Hobbies” could be anything from stamp collecting to skydiving. It lacks a consistent vocabulary for the model to latch onto.
- The Easier Topics: “Sports” and “Gaming” tend to have very specific vocabulary (names of teams, “gameplay,” “match”), making them easier to classify.
Interestingly, GPT-4o struggled specifically with “Business” in Japanese and “Youth” in English, showing that different architectures have different blind spots.

Table 5 provides a look at Precision (accuracy when it predicts a label) vs. Recall (ability to find all correct labels).
- GPT-4o tends to have high precision but lower recall. It is conservative; it doesn’t want to guess wrong, so it often misses labels.
- XLM-T is more balanced.
- The LLMs (ChatGPT) tended to over-predict or under-predict the number of labels. While humans assigned an average of 1.8 topics per tweet, ChatGPT often predicted fewer or more depending on the prompt sensitivity.
Conclusion and Future Implications
The X-Topic paper is a significant step forward for multilingual Natural Language Processing. It moves us away from English-centric, news-focused benchmarks and into the messy, real world of global social media.
Key Takeaways:
- Context Matters: You cannot rely on generic web models for social media; domain adaptation (like XLM-T) is essential.
- Together is Better: Training on multiple languages simultaneously helps the model improve on all of them.
- LLMs are Good, but Specialist Models are Better: For specific classification tasks, a fine-tuned model is still more accurate (and much cheaper to run) than a massive generative model like GPT-4o.
This dataset provides a playground for future research. It highlights the difficulty of low-resource languages like Greek and the challenge of interpreting “slice of life” content that dominates our feeds. As social media continues to connect the world, tools like X-Topic will be essential for understanding the global conversation.
](https://deep-paper.org/en/paper/2410.03075/images/cover.png)