Language is a living, breathing entity. It shifts, morphs, and reinvents itself daily. If you spend any time on social media, you know that a phrase like “The sunset is beautiful, isn’t it?” can mean something entirely different today than it did ten years ago. To a human, context makes the meaning clear. To a Large Language Model (LLM), however, these shifts are a nightmare.

LLMs are typically trained on static datasets with a specific “knowledge cutoff.” This means they are often frozen in time. When a new slang term, meme, or metaphorical expression explodes onto the scene after the model was trained, the AI is left guessing. It might interpret “The Winter Arc” as a literal trajectory of the sun rather than a period of self-improvement, or miss the subtle heartbreak hidden in a seemingly mundane sentence.

In a recent paper titled “SLANG: New Concept Comprehension of Large Language Models,” researchers tackle this problem head-on. They introduce a new benchmark (SLANG) to test how well models adapt to new words, and a novel method (FOCUS) that uses causal inference to help models figure out what new phrases mean without expensive retraining.

Comparative analysis of LLMs’ understanding of new phrases. On the left, standard Chain-of-Thought interpretation fails to grasp the metaphor. On the right, the FOCUS method correctly identifies the deeper meaning.

As shown in Figure 1, standard prompting methods often result in literal, surface-level interpretations. The researchers’ proposed method, FOCUS, digs deeper to find the metaphorical truth. In this post, we will break down how they achieved this, exploring the construction of their unique dataset and the mechanics of their causal inference engine.

The Problem with Static Models in a Dynamic World

The core issue is the disconnect between the static nature of AI training and the dynamic nature of human language. Traditional methods to fix this involve:

  1. Continual Retraining: Constantly feeding the model new data. This is computationally expensive and slow.
  2. Retrieval-Augmented Generation (RAG): Giving the model access to external search results. While useful for facts, RAG often struggles with the nuance of slang. It might find a definition, but fail to apply it correctly in a specific emotional context.

Furthermore, LLMs are prone to taking “shortcuts.” They rely on superficial correlations they learned during training. If a sentence contains the word “sunset,” the model biases its output toward nature or weather, ignoring the surrounding context that might imply a breakup or a philosophical realization.

To solve this, the researchers needed two things: a way to measure the problem (SLANG) and a way to fix it (FOCUS).

Part 1: SLANG – A Benchmark for the New Wave

To test whether an AI understands new concepts, you cannot use concepts it has already seen. You need data that is genuinely novel. The researchers turned to Urban Dictionary, a crowdsourced dictionary known for capturing real-time language evolution.

Building the Dataset

Creating the SLANG benchmark wasn’t as simple as scraping the website. The team employed a rigorous filtering process to ensure quality and novelty:

  • Temporal Filtering: They selected entries added after the training cutoff dates of major models (like GPT-4). This ensures the AI hasn’t memorized the answer.
  • Quality Control: Urban Dictionary is full of noise. The team used upvote/downvote ratios to filter out low-quality or offensive definitions.
  • Novelty Checks: They ran “Needle in a Haystack” tests to confirm the models genuinely didn’t know these terms.

Comparative histograms illustrating the distribution of upvotes before and after data cleaning.

As Figure 2 illustrates, the raw data (blue) was noisy. The cleaning process (orange) normalized the distribution, ensuring that the benchmark consisted of high-quality, widely accepted community terms rather than obscure inside jokes.

Diversity of Language

One of the strengths of SLANG is its variety. Internet speak isn’t just abbreviations (like “LOL”). It includes complex metaphors, gaming jargon, and pop culture references.

Bar chart showing the distribution of internet slang categories in the dataset, including abbreviations, metaphors, and technical terms.

The researchers categorized their final dataset (Figure 6) to ensure it covered a broad spectrum of linguistic shifts. From “The Winter Arc” (metaphor) to new gaming terminologies, the dataset challenges the model to understand various forms of expression.

Crucially, they created two versions of the data:

  1. Factual: Real-world examples and definitions.
  2. Counterfactual: Modified examples where the context is changed to mean something else. This tests if the model is actually reasoning or just guessing based on keywords.

Part 2: FOCUS – Understanding via Causal Inference

This is where the paper introduces its most significant technical contribution. FOCUS (Factual cOntext CaUsal analysiS) is a method designed to make LLMs “think” more like a detective.

Instead of just predicting the next word, FOCUS uses Causal Inference. It tries to disentangle the context from the phrase to understand the true relationship between them.

The Structural Causal Model (SCM)

To understand FOCUS, we have to look at how LLMs usually process information versus how we want them to process it.

Structural Causal Model diagram showing relationships between Context (X), Phrase (W), and Explanation (Y).

In a standard interaction (Figure 3), the model (\(\mathcal{M}\)) takes a Phrase (\(W\)) and a Context (\(X\)) to produce an Explanation (\(Y\)).

  • X (Context): “Adrian texts Jamie after contemplating irreconcilable differences…”
  • W (Phrase): “The sunset is beautiful, isn’t it?”
  • Y (Explanation): The output meaning.

The problem is that the model often ignores the subtle causal link between \(X\) and the meaning, and over-relies on the literal meaning of \(W\).

The Intervention

To fix this, FOCUS applies a “do-operation”—a concept from causal theory where you intervene in a system to see what changes.

SCM analysis diagram showing how interventions (do-operations) replace specific variables to isolate causal effects.

As shown in Panel (d) of Figure 4, the method mathematically intervenes by substituting parts of the context and the phrase. The goal is to calculate the probability of the explanation \(Y\) given the context \(X\), while removing the bias introduced by the literal words in the phrase \(W\).

The mathematical formulation for this intervention is captured below. While the equation might look intimidating, it essentially says: Calculated the probability of the meaning, considering the context and linguistic factors, but treating the specific phrase as a variable we can control.

Equation representing the causal effect calculation after the do-operation intervention.

The 4-Stage Pipeline

You don’t need to be a mathematician to understand how FOCUS works in practice. The researchers translated their causal theory into a four-stage pipeline that prompts the LLM in a specific sequence.

The four-stage pipeline of FOCUS: Direct Inquiry, Masked Entity Inquiry, Entity Replacement Inquiry, and Synthesis.

Let’s walk through the stages shown in Figure 5 using the example of “The Winter Arc.”

  1. Direct Inquiry (DI): The model is asked to interpret the phrase normally.
  • Result: Often a literal or generic guess.
  1. Masked Entity Inquiry (MEI): The specific slang phrase is hidden (masked). The model must look at the rest of the sentence (the context) and guess what the missing phrase should mean.
  • Logic: This forces the model to rely entirely on context (\(X\)), ignoring the literal bias of the phrase (\(W\)).
  1. Entity Replacement Inquiry (ERI): The model replaces other nouns or entities in the sentence with similar ones (e.g., changing “gym” to “library” or “winter” to “night”).
  • Logic: This tests robustness. If the meaning holds up even when the scenery changes, the model has found the core concept.
  1. Synthesis (SY): The model combines the insights from the Direct, Masked, and Replacement steps. It reconciles the literal meaning with the context-derived meaning to produce a final, accurate explanation.

Experimental Results

Does this complex causal reasoning actually work better than just asking GPT-4 to “think step by step”? The results suggest a resounding yes.

The researchers tested FOCUS against several baselines:

  • Direct: Standard prompting.
  • CoT: Chain-of-Thought (asking the model to explain its reasoning).
  • CauView: A previous causal method.

Performance on Factual Data

Table showing performance results on the factual dataset. FOCUS outperforms Direct and CoT methods across almost all metrics.

In Table 4, we see the results on the standard SLANG dataset.

  • GPT-4: The FOCUS method achieved an F1 score of 0.4446, significantly higher than Chain-of-Thought (0.4123) and standard prompting (0.2308).
  • Claude 3 Opus: The results were even more impressive, with FOCUS reaching an F1 score of 0.4596 and an accuracy of nearly 90%.

The “Accuracy” column is particularly telling. On GPT-4, standard prompting only got the definition right 47% of the time. FOCUS bumped that up to 88.2%. This is the difference between a model that is guessing and a model that understands.

Performance on Counterfactual Data

The true test of intelligence is adaptability. In the counterfactual experiments, the researchers tweaked the contexts to force new meanings.

Table showing performance results on the counterfactual dataset. FOCUS maintains high accuracy even when contexts are manipulated.

As Table 5 shows, FOCUS continued to dominate. While standard methods fell apart (Mistral-7B’s Direct accuracy dropped to 20%), models using FOCUS maintained high accuracy (Mistral-7B jumped to 77.5% with FOCUS). This proves that the method isn’t just memorizing definitions; it is actively analyzing the context to derive meaning.

Why It Works (Ablation Study)

The researchers also performed an “ablation study,” which involves removing parts of the pipeline to see what breaks.

  • Removing the Masked Entity Inquiry (MEI) caused a significant drop in performance. This confirms that forcing the model to look at the context without the phrase is the key to breaking literal bias.
  • Removing Entity Replacement (ERI) also hurt performance, showing that checking for robustness helps refine the final answer.

Conclusion and Implications

The “SLANG” paper highlights a critical limitation in current AI: the inability to keep up with the speed of human culture. However, it also provides a roadmap for the future.

By using FOCUS, we don’t necessarily need to retrain massive models every time a new meme goes viral. Instead, we can use smarter inference techniques—specifically Causal Inference—to help the model reason its way to the correct answer using context clues.

This approach has broad implications beyond just slang. It could help models understand:

  • Industry-specific jargon in legal or medical documents.
  • Evolving political or social discourse.
  • Ambiguous instructions in complex tasks.

The takeaway is clear: To build AI that truly understands us, we need to teach it not just to read the words, but to understand the cause behind them. As our language evolves, our AI must evolve with it—not just by learning more data, but by thinking better.