Decoding Internet Culture: Can LLMs Understand the Nuance of Online Communities?

Introduction

If you spend any time on social media, you know that a picture of a Shiba Inu dog isn’t always just a cute pet photo. In specific corners of the internet, it might be a political statement, a form of activism, or a “bonk” against propaganda. Similarly, a tweet predicting the winner of a song contest might actually be a coded expression of national solidarity during wartime.

These are examples of practices—distinct patterns of behavior and linguistic expression that define online communities. They are rich in “social meaning,” relying on inside jokes, shared values, and specific vernacular that outsiders might miss completely. For computer scientists and sociologists, detecting these practices at scale is a massive challenge. Traditional text classification models often fail to grasp the sarcasm, the context, or the intent behind a post.

So, the question arises: Can we teach Large Language Models (LLMs) like GPT-4 to act like sociologists? Can they learn to decipher the “vibe” and intent of a community, not just the literal text?

In this post, we are deep-diving into the research paper “Detecting Online Community Practices with Large Language Models.” We will explore how researchers built a “gold-standard” dataset of pro-Ukrainian communities (specifically NAFO and Eurovision fans) and used advanced prompting techniques to identify complex social practices. Whether you are a student of NLP or digital sociology, this study offers a fascinating blueprint for bridging the gap between qualitative understanding and computational scale.

Background: What Are “Practices” and Why Do They Matter?

Before we jump into the algorithms, we need to understand the sociological problem. Online communities aren’t just groups of people talking about a topic; they are groups of people doing things together through text.

The authors draw on the concept of speech acts. This is the idea that when we speak (or tweet), we aren’t just transmitting information; we are performing an action.

Locution: What is said (the literal text).
Illocution: What is meant (the intent).
Perlocution: What is done (the effect).

For example, if a user in the NAFO (North Atlantic Fella Organization) community tweets a meme mocking a diplomat, they aren’t just sharing a joke. They are engaging in “Shitposting”—a specific practice designed to derail propaganda and provoke opponents. If a Eurovision Song Contest (ESC) fan tweets detailed statistics about Ukraine’s voting history, they are engaging in “Knowledge Performance”—affirming their status as an expert and valuing the event.

The Challenge of Scale

Traditionally, identifying these practices required human experts to read tweets one by one—an approach known as qualitative analysis. It’s accurate but impossible to scale to millions of tweets. On the other hand, standard machine learning classifiers often miss the social nuance. The researchers in this study aimed to find a middle ground: using the reasoning capabilities of LLMs to replicate human analysis at scale.

The Case Studies

The paper focuses on two distinct communities supporting Ukraine during the Russia-Ukraine war:

NAFO: A chaotic, humorous, self-mobilized collective known for using “Fella” avatars (Shiba Inus) to mock Russian disinformation.
Eurovision (ESC) Fans: A more established community that used the song contest to express solidarity with Ukraine (the 2022 winners).

The Core Method: From Human Insight to LLM Prompting

The heart of this paper is its methodological workflow. The authors didn’t just throw data at ChatGPT. They carefully constructed a framework to teach the model how to think like a community member.

Step 1: The Analytical Schema

To detect a practice, you have to break down the cognitive process of understanding a tweet. The researchers adopted a schema from sociologist Silvia Gherardi.

Take a look at the diagram below. It visualizes the mental leap from raw text to a labeled practice.

Figure 1: Analytical schema for identifying practices adopted from Gherardi (2012). Images represent communities analysed in this study -a) fans of the Eurovision Song contest known for their active use of Twitter, and b) NAFO,recognized for its efforts in debunking Russian propaganda on Twitter, characterized by avatars featuring Shiba Inu dogs.

In Figure 1, you can see two examples of this workflow:

Left (Eurovision): The user comments on countries going to the finals. The meaning is a prediction based on evaluation. The action is “affirming one’s taste.” The resulting practice is Knowledge Performance.
Right (NAFO): The user sarcastically apologizes for “NAFO service.” The meaning is that the opponent’s frustration is ridiculous. The action is playfully dismissing the adversary. The practice is Shitposting.

This “Said $\rightarrow$ Meant $\rightarrow$ Done” logic becomes the foundation for their prompt engineering later on.

Step 2: Building the Gold Standard

To train and test the models, the researchers needed ground truth data. They collected over 4 million NAFO tweets and nearly 600,000 ESC tweets. They then filtered these down and performed rigorous human annotation.

Interestingly, they didn’t just guess what the practices were. They interviewed 27 community members, asking them to scroll through their timelines and explain why they posted certain things. This resulted in a codebook of specific practices.

Table 1:Proportion of posts per practice in the gold standard data set. Total number of posts is 1127 for NAFO and 1000 for ESC.Priority for NAFO listed first in curly brackets where diferent between the case studies.Empty cells indicate practices not applicable to a case study.

Table 1 shows the distribution of these practices. Notice the variety:

Mobilising: Calling others to action (very high in NAFO).
Community Work: Building cohesion among members.
News Curation: Sharing information (very high in ESC).
Audiencing: Participating in the live event (the highest category for ESC).

Step 3: Prompt Engineering Strategies

This is where the study shines for NLP students. The authors tested three levels of prompt sophistication to see how well LLMs could learn these sociological concepts.

1. Practice Description (PD) Prompts

This was the baseline approach. The prompt provided the LLM with the name of the practice and a standard definition (e.g., “Shitposting: Tweets that contain humorous… content to highlight flaws of propaganda”). The model was given a few examples (few-shot learning) and asked to classify new tweets.

2. PD + MPE Prompts (Adding Nuance)

The researchers realized that definitions weren’t enough. They enriched the prompts with MPE:

Markers: Specific slang or hashtags (e.g., “#bonk”, “vatnik”, “Slava Ukraini”).
Prioritisation: Instructions on which label to choose if a tweet fits two categories (e.g., “If it’s funny but attacks a diplomat, prioritize Shitposting over Play”).
Exclusion: Specific rules on what not to include.

This mimics the instructions given to human annotators, effectively “uploading” the domain expertise into the prompt.

3. PD + COT (Chain-of-Thought)

Recall Figure 1? The researchers used Chain-of-Thought (COT) prompting to force the LLM to go through that exact analytical process. Instead of just asking for a label, the prompt required the model to output:

“What is said”
“What is meant”
“What is done”
Finally, the Practice label.

By forcing the model to “show its work” and reason step-by-step, they hoped to improve accuracy on subtle categories like sarcasm.

Experiments & Results

The researchers compared OpenAI’s models (GPT-3.5 and GPT-4) against open-source baselines like Support Vector Machines (SVM) and SetFit (a transformer-based framework). They used F1 scores to measure performance, which balances precision and recall.

Baseline vs. LLMs

First, let’s look at how standard models compared to the LLMs.

Table 2: Practice prediction results (macro-averaged F1 with standard deviation across five folds in brackets) for baseline models and practice description (PD) prompts. MP and DR stand for MPNET and DistilRoBERTA, respectively. K indicates the number of demonstration samples.

Table 2 reveals a clear winner.

Baselines struggle: The SVM and SetFit models (even with 8 examples) hovered around F1 scores of 20-30. They struggled significantly with practices that require inferring intent, like “Self-promotion” or “Expressing solidarity.”
LLMs dominate: GPT-4 (using basic PD prompts) achieved scores near 50 immediately, even with zero examples (K=0). This demonstrates that the pre-training of these massive models contains enough “world knowledge” to understand social context better than smaller, fine-tuned models.

The Power of Prompting

The most exciting finding is how much better GPT-4 got when the prompts were improved with sociological insights.

$Table 3: Comparison of practice description (PD) performance with the addition of MPE and COT prompts in the \$\\mathbf { K } { = } 1\$ setting with GPT-4. Results are presented as macro-averaged F1 and standard deviation across five folds.A dagger indicates a statistically significant increase according to paired t-test calculated at \$p \\leq 0 . 0 5\$$

Table 3 shows the progression:

PD (Basic): ~46-49% F1 score.
PD + MPE: Adding Markers and Exclusion criteria jumped the score significantly (e.g., 52.39% for NAFO).
PD + COT: Adding reasoning steps also provided a major boost.
The Combination (PD+COT+MPE): When everything was combined—definitions, slang markers, exclusion rules, and step-by-step reasoning—the performance peaked (56.88% for NAFO, 58.71% for ESC).

This confirms that for complex social tasks, how you ask is just as important as what you ask. The model needs the “vernacular” (MPE) to recognize the text and the “reasoning” (COT) to understand the intent.

detailed Performance Analysis

Let’s look at exactly which practices benefited from these strategies.

$Table 11: Per-class comparison of GPT-4’s performance in \$K { = } l\$ setting with PD (Practice Descrtiption), \$\\mathrm { P D + M P E }\$ (Markers,Priority, Exclusion criteria), \$\\mathrm { P D + C O T }\$ (Chain-of-Thought), and \$\\mathrm { P D + C O T + M P E }\$ in-context learning prompts. We report a mean F1 score for each class across five folds,and a macro-averaged F1 score for all categories, with standard deviation in brackets.Bold font indicates the highest score for the specific practice.$

In Table 11, look at the NAFO section:

Expressing Solidarity: The score jumped from 39.75 (PD) to 63.66 (Combined). Why? Likely because the MPE prompts included specific slang markers like “Slava Ukraini” that the base model might have missed or miscategorized.
Meme Creation: Jumped from 49.37 to 64.46.
Shitposting: This is a notoriously hard category because it relies on irony. The combination prompt raised it from 34.56 to 42.03.

For ESC (Eurovision):

Audiencing: This practice involves live-tweeting the event. The score skyrocketed from 44.14 to 70.17 with the combined prompt.
Expressing Emotions: Jumped from 44.68 to 64.30.

This data suggests that Chain-of-Thought helps the model “think through” the emotional state of the user, while Markers help it identify the specific topic.

Error Analysis: Where do LLMs Fail?

Despite the success, the models weren’t perfect. The researchers analyzed the confusion matrices to see where GPT-4 was making mistakes.

$Figure 2: Confusion matrices for in-context learning with \$\\mathrm { P D+ C O T + M P E }\$ prompts.$

Figure 2 shows the confusion matrices. The diagonal line represents correct predictions. The distinct spots off the diagonal represent errors.

Humor and Sarcasm: In the NAFO matrix (top), notice the confusion around “Shitposting.” The model often misclassified it as “Arguing.”

Example: A user tweets, “I can’t see you riding a bike!” to mock a leader backpedaling. The model might take this literally or just see it as an argument, missing the “shitposting” humor.

Stance Detection: A major issue was distinguishing between pro-Ukrainian users and pro-Russian trolls. If a Russian troll used NAFO-like language to mock the movement, the model sometimes classified it as valid “Shitposting” or “Arguing” rather than “Not Applicable.” The model struggled to detect the political stance of the speaker without explicit instruction.
Overlapping Practices: In the ESC matrix (bottom), there is confusion between “Expressing Solidarity” and “Community Imagining.”

Example: “We stand with Ukraine! Go UK!” contains both solidarity and national community imagining. Human coders struggled with this too, and the model often picked the “wrong” primary label.

Conclusion and Implications

This study provides a roadmap for analyzing online communities without spending years manually reading tweets. The authors demonstrated that while LLMs are not perfect sociologists, they are incredibly capable assistants when guided correctly.

Key Takeaways:

Context is King: You cannot just ask an LLM to “classify this tweet.” You must provide the “social code”—the markers, slang, and rules of the community.
Prompting as Programming: The significant performance boost from PD+COT+MPE prompts proves that domain expertise (knowing the community) can be effectively transferred to the model via prompt engineering.
The “Human” Touch: The best results came from replicating the human analytical process (Said -> Meant -> Done) inside the model via Chain-of-Thought.

The Future of Practice Mapping: This methodology opens the door for “Practice Mapping” at scale. Researchers could potentially track how a community’s practices change over time, how radicalization happens, or how humor drives political engagement.

However, the difficulty with sarcasm and stance detection serves as a warning. We are not yet at the point where we can fully automate the moderation of complex political speech. The ambiguity of human language—where a joke can be a weapon and a song contest can be a geopolitical battlefield—still requires human oversight. But with tools like GPT-4 and clever prompting, we are getting closer to computers that understand not just what we say, but what we do with our words.

Introduction#

Background: What Are “Practices” and Why Do They Matter?#

The Challenge of Scale#

The Case Studies#

The Core Method: From Human Insight to LLM Prompting#

Step 1: The Analytical Schema#

Step 2: Building the Gold Standard#

Step 3: Prompt Engineering Strategies#

1. Practice Description (PD) Prompts#

2. PD + MPE Prompts (Adding Nuance)#

3. PD + COT (Chain-of-Thought)#

Experiments & Results#

Baseline vs. LLMs#

The Power of Prompting#

detailed Performance Analysis#

Error Analysis: Where do LLMs Fail?#

Conclusion and Implications#