Imagine you are learning a new language. You pick up a newspaper, start reading, and suddenly hit a wall. There is a word you just don’t understand. It disrupts your flow and comprehension. Now, imagine a computer system that could scan that text before you read it, identify those difficult words, and automatically replace them with simpler synonyms.
This is the goal of Lexical Simplification, and its first, most crucial step is Complex Word Identification (CWI).
For years, researchers have built specialized machine learning models to detect which words are “complex” for non-native speakers. But recently, the Artificial Intelligence landscape shifted tectonically with the arrival of Large Language Models (LLMs) like GPT-4 and Llama.
In this post, we are going to tear down a fascinating research paper titled “Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups.” We will explore whether these massive, general-purpose “brains” can outperform specialized, lightweight models at judging word difficulty, or if they are just over-engineered hammers looking for a nail.
The Problem: What Makes a Word “Complex”?
Before we look at the models, we need to define the task. Complexity is subjective. To a native English speaker, the word “ubiquitous” might be standard; to a beginner, it’s alien.
The research focuses on two specific variations of this problem:
- Complex Word Identification (CWI): A binary classification task. Is this word complex? (Yes/No).
- Lexical Complexity Prediction (LCP): A probabilistic task. How complex is this word on a continuous scale from 0 to 1?
Previous approaches relied on feature engineering—manually calculating word length, syllable count, or frequency in a corpus—fed into algorithms like Random Forests or LSTMs. This paper investigates whether we can ditch the manual features and just ask an LLM: “Is this word hard?”
The Methodology: Teaching LLMs to Grade Difficulty
The researchers tested a variety of models, ranging from open-source options like Llama 2, Llama 3, and Vicuna, to closed-source giants like ChatGPT (GPT-3.5) and GPT-4.
But you can’t just feed a number into a text-generation model easily. The researchers had to design a clever evaluation protocol.
From Text to Numbers
Since LLMs are designed to generate text, not regression scores, the authors used a Likert scale mapping. They prompted the models to classify words into five discrete categories: Very Easy, Easy, Neutral, Difficult, or Very Difficult.
To get a precise numerical score (for the LCP task), they didn’t just take the first answer. They utilized the probabilistic nature of LLMs. By running the inference multiple times with a higher “temperature” (which increases randomness), they obtained a distribution of answers.
They calculated the final complexity score using the expectation formula:

Here, \(p(s)\) is the probability of the model outputting a specific score \(s\) (mapped from the Likert scale, e.g., Easy = 0.25, Difficult = 0.75). This allows a text-generating model to produce a nuanced, continuous complexity score.
The Evaluation Pipeline
The researchers didn’t just type questions into ChatGPT. They built a robust pipeline to handle different prompting strategies and model architectures.

As shown in Figure 4, the system processes datasets (like CWI 2018 or CompLex LCP 2021) through specific templates. They tested three main strategies:
- Zero-Shot: Asking the model directly without examples.
- Few-Shot: Giving the model a few examples of complex and non-complex words before asking the question.
- Chain-of-Thought (CoT): Asking the model to explain its reasoning (provide a proof) before giving the final label.
Meta-Learning: Learning to Learn
The paper goes a step further by exploring Meta-Learning. This is a technique where a model learns a “initialization” that is good at adapting to new tasks quickly. They used an algorithm called FOMAML (First-Order Model-Agnostic Meta-Learning).

Essentially, they trained the model on 45 diverse tasks from the “BIG-bench” benchmark (tasks that require reasoning, logic, etc.) to prime it, hoping this “intrinsic knowledge” would transfer to the specific task of identifying complex words.
Experimental Setup
The researchers didn’t limit themselves to English. They evaluated the models in a multilingual setting (English, German, Spanish) and a multidomain setting (News, Wikipedia, Bible, Biomedical).
Here is a breakdown of the datasets used:

And here are the specific model checkpoints they pitted against each other:

Key Results: The Reality Check
So, did the LLMs crush the competition? The answer is nuanced: No, not out of the box.
1. Zero-Shot Struggle vs. Fine-Tuning Success
In the zero-shot setting (where the model sees no examples), LLMs generally performed worse than older, lighter baseline models (like standard Random Forests or RoBERTa-based ensembles).
However, Fine-tuning changed the game. When the authors explicitly trained the LLMs on the CWI datasets, the performance skyrocketed, becoming competitive with state-of-the-art methods.
Let’s look at the confusion matrices for Llama 2 7B to visualize this improvement.

In the image above, look at the difference between (a) Zero-shot and (b) Fine-tune.
- In Zero-shot (a), the model is somewhat scattered. It often misclassifies words.
- In Fine-tune (b), the diagonal (correct predictions) becomes much darker and more defined. The model learned the specific boundary of what constitutes “complexity” for this dataset.
2. The “Safe Bet” Bias (Probability Distribution)
One of the most interesting findings is how LLMs distribute their predictions. Humans know that some words are “Very Difficult.” LLMs, however, seem afraid to use the extreme ends of the scale.

In Figure 1 (shown above), examine the Fine-tuned row (b). You will notice the model almost never predicts a probability in the 0.8–1.0 range (Very Difficult), even though the dataset contains such words. The models prefer to hedge their bets, clustering around “Easy” or “Neutral.” This “safety bias” prevents them from achieving perfect correlation with human annotators, who are more willing to label a word as extremely hard.
3. The Hallucination Problem
A major risk with LLMs is hallucination. In this context, hallucination doesn’t just mean making up facts; it means the model fails to follow the instruction. For example, the model might analyze the wrong word in the sentence, or rewrite the sentence entirely.

Table 5 reveals a clear trend: Smaller models (like Llama-2-7b) hallucinate significantly more than larger ones (like GPT-4).
- Llama-2-7b-chat (Zero-shot) had a word error rate of 3.8% on the CompLex dataset.
- GPT-4 had 0%.
Interestingly, Few-shot prompting (giving examples) drastically reduced these hallucinations for the open-source models. It effectively “grounded” the models, reminding them of the exact format required.
4. Chain-of-Thought: Does Reasoning Help?
The researchers tested whether asking the model to provide a “proof” (reasoning) improved accuracy.

In the table above, you can see the model explaining why “ft” (feet) is complex: “The abbreviation ‘ft’… may be challenging for beginners.”
While the reasoning looks sound to a human, the quantitative results showed mixed benefits. CoT improved performance in some zero-shot settings but didn’t always beat standard fine-tuning. Sometimes, the model’s reasoning was flawed—it would invent a justification for a wrong label, reinforcing its own error.
Discussion: The Cost of Intelligence
The paper concludes with a critical realization regarding efficiency.
State-of-the-art baselines (like DeepBlueAI or RoBERTa ensembles) are relatively small models. They run fast and require less hardware.
In contrast, the LLMs evaluated here are massive. Llama 2 13B has billions of parameters. GPT-4 is even larger and costs money per token. Despite this massive computational overhead, the LLMs barely outperformed (and often underperformed) the smaller, specialized models unless they were heavily fine-tuned.
Meta-Learning Insights
The meta-learning experiments (using FOMAML) were computationally expensive and yielded results that were comparable to, but not significantly better than, standard fine-tuning or prompt-tuning. This suggests that for the specific task of CWI, generic “reasoning” training on the BIG-bench dataset might not transfer perfectly to the nuance of lexical complexity.
Conclusion and Future Outlook
This research provides a vital “reality check” in the era of Generative AI. While LLMs are incredibly versatile, they are not magic wands that solve every NLP task instantly.
Key Takeaways:
- Fine-tuning is essential: You cannot rely on zero-shot LLMs for reliable complex word identification. They need to be taught the specific threshold of difficulty.
- Size matters for stability: Larger models (GPT-4) follow instructions better and hallucinate less than smaller open-source models (Llama 7B).
- Efficiency wins: If your goal is purely CWI, existing lightweight models are still the most efficient choice. You don’t need a sledgehammer (LLM) to crack a nut (identify a hard word).
- The “Average” Bias: LLMs struggle to commit to extreme labels (“Very Difficult”), which limits their accuracy in Probabilistic LCP tasks.
For students and researchers, this paper highlights that while LLMs are powerful tools, rigorous evaluation and comparison against traditional baselines are necessary to determine if they are actually the right tool for the job. Future work lies in reducing hallucinations and perhaps using LLMs not as the judges, but as the generators of simplified text once the complex words have been identified by more specialized models.
](https://deep-paper.org/en/paper/2411.01706/images/cover.png)