Introduction

Imagine you are learning a new language. You’ve mastered the basics, and you want to practice reading. You pick up a news article, but the vocabulary is too dense. You try a children’s book, but the grammar is surprisingly complex. This frustration is a common hurdle in second language acquisition, and it highlights a critical task in Natural Language Processing (NLP): Readability Assessment.

Readability assessment is the automated process of determining how difficult a text is to comprehend. For decades, this field relied on simple statistics—counting syllables per word or words per sentence. Today, Large Language Models (LLMs) promise to revolutionize this by “reading” the text and understanding its semantic depth.

However, there is a problem. Most existing benchmarks for training these AI models are severely limited. They tend to focus almost exclusively on English, rely on specific domains like Wikipedia, or use entire documents rather than specific sentences. This leaves a massive gap: How do we know if a model can judge the readability of a Hindi poem, a French legal contract, or an Arabic tweet?

Enter README++, a groundbreaking new dataset and benchmark introduced by researchers from Georgia Tech. This paper presents a massive step forward, offering a multilingual, multi-domain resource that challenges how we evaluate language models.

Figure 1: Language distribution per each domain in README++. Example sentences from each language are shown along with their human-annotated readability levels on a 6-point scale.

In this post, we will tear down the README++ paper, exploring how the dataset was built, the novel “Rank-and-Rate” annotation method, and the surprising insights revealed when modern LLMs were put to the test across five different languages.

Background: The Readability Gap

To understand why README++ is necessary, we first need to look at the state of the field.

Traditionally, readability was measured by formulas like the Flesch-Kincaid Grade Level. These formulas assume that long sentences and long words equal “hard” text. While useful, they fail to capture nuance. A short sentence using obscure jargon is harder to read than a long sentence using simple words, but a formula might say otherwise.

Neural networks and LLMs have moved us beyond simple counting. They can analyze context and meaning. However, a model is only as good as the data it is trained on.

The Limitations of Current Datasets

Prior to README++, the landscape of sentence-level readability datasets suffered from three main issues:

  1. Domain Bias: Most datasets scrape Wikipedia or news articles. Models trained on this data struggle to generalize to other types of text, like technical manuals, social media, or dialogue.
  2. Anglocentrism: There is a scarcity of high-quality labeled data for languages other than English, particularly those using non-Latin scripts like Arabic, Hindi, or Russian.
  3. Label Quality: Many datasets use arbitrary scales (e.g., 1-100) or document-level labels applied loosely to sentences, which introduces noise.

README++ was designed to fill these voids. It specifically targets the CEFR (Common European Framework of Reference for Languages) scale, which rates proficiency from A1 (Beginner) to C2 (Mastery). This grounds the AI’s predictions in a real-world educational standard.

Core Method: Constructing README++

The researchers didn’t just scrape the web and call it a day. They undertook a rigorous construction process to ensure diversity and quality.

1. Unprecedented Diversity

The dataset consists of 9,757 sentences across five languages: Arabic, English, French, Hindi, and Russian.

What makes this truly unique is the source material. The authors collected text from 112 different data sources covering a wide array of domains.

Table 2: List of domains and example data sources in README++.

As shown in the table above, the domains range from Literature and Legal texts to User Reviews, Captions, and even Jokes. This ensures that a model trained on README++ isn’t just learning “Wikipedia style” but is learning the intrinsic properties of difficult text, regardless of the format.

To visualize the difference in coverage, look at the comparison below between README++ and a previous standard dataset, CEFR-SP.

Figure 2: Distribution of sentence lengths across readability levels in the English portion of README++, compared with CEFR-SP.

In Figure 2, notice the spread. The previous dataset (CEFR-SP) clusters tightly around specific lengths and readability scores. README++ (the blue line) spans a much wider range of sentence lengths and difficulty levels, providing a more realistic picture of language as it is actually used.

2. The “Rank-and-Rate” Annotation System

One of the hardest parts of building a subjective dataset is getting humans to agree. If you ask three people to rate a sentence from 1 to 6, you often get three different answers based on their personal biases.

To solve this, the authors utilized a Rank-and-Rate approach.

Instead of showing an annotator one sentence in isolation, they showed them a batch of 5 sentences. The annotators were asked to:

  1. Rank the 5 sentences from easiest to hardest relative to each other.
  2. Rate each sentence individually on the CEFR scale.

Figure 16: Screenshot of the developed annotation interface for rating English readability sentences. Annotators first rank sentences according to their readability level by simply dragging the box.

This method anchors the annotator’s judgment. By comparing sentences, the relative difficulty becomes clearer, leading to much higher agreement and quality in the final labels.

Benchmarking: How do Models Perform?

With the dataset constructed, the authors moved to the experimental phase. They wanted to answer a crucial question: How well do current state-of-the-art models understand readability across these languages and domains?

They tested three distinct approaches:

  1. Supervised Learning: Fine-tuning models like BERT, XLM-R, and mT5 specifically on the README++ training data.
  2. Few-Shot Prompting: Asking LLMs like GPT-4 and Llama 2 to rate sentences after seeing just a few examples.
  3. Unsupervised Methods: Using mathematical formulas derived from model probabilities without explicit training.

1. Supervised vs. Prompting Results

The results highlighted a significant gap between fine-tuned “smaller” models and massive general-purpose LLMs.

Figure 4: Pearson correlation of fine-tuned multilingual and monolingual LMs, as well as prompted GPT3.5, GPT4, Aya23-8b, Llama2-7b, and Llama3.1-8b models.

As seen in Figure 4, fine-tuned models (like mT5 and XLM-R) generally outperformed prompting methods, even powerful ones like GPT-4.

  • mT5 (Large) was a standout performer across all languages.
  • Prompting models struggled to match the precision of supervised models, although GPT-4 showed respectable performance.

However, the authors found a way to boost the performance of prompting: Context Diversity.

When they prompted Llama 2 with examples all drawn from a single domain (e.g., only News), the model performed poorly on other domains. But when the few-shot examples were selected from diverse domains (e.g., one News, one Poem, one Contract), the model’s correlation skyrocketed.

Figure 5: Effect of domain diversity of in-context examples on Llama2-7b performance. Correlation is greatly improved when examples are sampled from an increasing number of domains.

This finding, illustrated in Figure 5, offers a practical lesson for prompt engineers: When asking an LLM to perform a subjective task like assessment, diversity in your examples matters more than just the quantity of examples.

2. The Unsupervised Challenge: The Transliteration Trap

The researchers also evaluated an unsupervised metric called RSRS (Ranked Sentence Readability Score). This method uses a language model to calculate how “surprised” it is by the words in a sentence. The logic is: if a model assigns a low probability to a word (high loss), that word is likely rare and difficult.

The formula for RSRS looks like this:

Equation: RSRS Formula

Here, the metric heavily weighs words that the model finds difficult to predict. This works well for English. However, it failed spectacularly for languages with non-Latin scripts like Arabic, Hindi, and Russian.

Why? Transliterations.

In languages like Arabic or Hindi, proper nouns (like “Facebook”, “Internet”, or foreign names) are transliterated phonetically. To a language model, these transliterated words look statistically “rare” because they are foreign loanwords. The RSRS metric assumes “rare = difficult.”

But for a human reader, the word “Facebook” written in Arabic script is not difficult—it’s a very common concept.

Figure 7: Effect of increasing the penalty factor on the Pearson correlation between RSRS scores and human ratings for Arabic, Hindi and Russian sentences that contains transliterations.

The authors proved this hypothesis by applying a “penalty factor” to sentences containing transliterations (Figure 7). As they penalized the difficulty score of these sentences (effectively telling the math, “ignore these rare words”), the correlation with human judgment improved significantly. This reveals a major blind spot in current unsupervised readability metrics for non-Latin languages.

Cross-Domain and Cross-Lingual Power

The true strength of README++ lies in its ability to teach models to generalize. If you train a model on Wikipedia, it learns to judge Wikipedia. If you train it on README++, it learns to judge language.

Generalizing to Unseen Domains

To test this, the researchers set up a “Zero-Shot Domain Transfer” experiment. They trained a model on README++ and tested it on domains it had never seen before. They compared this against a model trained on the CEFR-SP dataset (which is mostly Wikipedia and News).

Figure 11: Pearson Correlation per domain for XLMR_L trained using README++ and CEFR-SP. The model trained with README++ achieves better domain generalization.

The results in Figure 11 are striking. The model trained on README++ (blue bars) outperforms the CEFR-SP model (orange bars) in almost every category.

  • Look at Poetry: The CEFR-SP model has almost zero correlation, meaning its guesses are random. The README++ model maintains a strong correlation.
  • Look at Letters and Medical: Again, the diversity of the training data allows the model to adapt to these distinct styles.

Cross-Lingual Transfer

Finally, the researchers asked: Can we train a model on English data and have it assess readability in Hindi or Russian?

Using multilingual models like XLM-R, they found that training on the English portion of README++ provided a massive boost to cross-lingual performance compared to previous datasets.

Table 5: Zero-shot cross-lingual transfer results using XLMR_L.

Table 5 shows that training on README++ (English) resulted in correlations (ρ) of 0.70+ for Hindi and Russian, and 0.76 for French. This is a remarkable capability, suggesting that the model learned a universal concept of “readability” (e.g., complexity of thought, structural density) that transcends specific vocabulary.

Conclusion

The README++ paper represents a significant maturation in the field of readability assessment. By moving beyond Anglocentric, single-domain datasets, the authors have exposed the limitations of previous metrics and provided a robust path forward.

Key Takeaways:

  1. Diversity is King: Models trained on diverse domains (README++) generalize far better than those trained on narrow domains (Wikipedia), even on text types they haven’t seen before.
  2. Context Matters in Prompting: For LLMs, providing few-shot examples from varied domains significantly improves performance.
  3. Scripts Change the Rules: Unsupervised metrics that work for English fail for languages like Arabic and Hindi due to phenomena like transliteration, requiring new, linguistically aware metrics.
  4. Universal Readability: High-quality training data allows models to transfer readability judgments across languages, paving the way for educational tools in low-resource languages.

For students and researchers in NLP, README++ serves as a reminder: before you build a model, look deeply at your data. The richness of the input defines the intelligence of the output.