Beyond the Scoreboard: Why Automated Essay Scoring Needs a New Direction

Imagine a high school English teacher sitting at their desk on a Sunday evening. In front of them is a stack of 150 essays on “The Effects of Computers on Society.” Grading each one takes at least 10 minutes. That is 25 hours of work—just for one assignment.

This scenario is the driving force behind Automated Essay Scoring (AES). For over 50 years, researchers have been chasing the holy grail of Natural Language Processing (NLP): a system that can read a student’s essay and instantly assign a score that matches what a human expert would give.

In the last decade, we have seen massive leaps in performance. Fueled by deep learning and large datasets, modern AES systems are achieving correlation scores with human graders that were previously thought impossible. However, a recent position paper by researchers Shengjie Li and Vincent Ng from the University of Texas at Dallas suggests that the field might be running in the wrong direction.

In this deep dive, we will explore their paper, “Automated Essay Scoring: A Reflection on the State of the Art.” We will unpack how the obsession with beating benchmark numbers might be stalling real progress, and we will look at the seven distinct recommendations the authors propose to fix the future of essay grading.

The Status Quo: How AES Works Today

To understand the critique, we first need to understand the current “recipe” for AES research.

At its core, AES is a supervised learning problem. You have an input (an essay) and an output (a holistic score, usually on a scale, like 1-6). The goal is to build a model that maps the text to the score.

The Ingredients

The Prompt: The specific question or topic the student is writing about.
The Essay: The student’s response.
The Rubric: The guidelines humans use to determine the quality of the writing.
The Score: The final integer value assigned to the work.

Let’s look at a concrete example from the paper to see what this data looks like in practice.

Table 1: A sample essay taken from Essay Set 1 of the ASAP corpus. The writing prompt is shown at the top. Table 2: Rubric for scoring essays in Essay Set 1 of the ASAP corpus.

As shown in Figure 1 above, we see a prompt asking students to write a letter to a newspaper about the effects of computers. The sample essay provided attempts to argue that computers benefit people by providing entertainment and communication.

However, if you read closely, you’ll notice the writing is repetitive (“So as you can see…”), the arguments are shallow, and the transitions are awkward. Based on the rubric in the bottom half of Figure 1, this essay received a Score of 3. It takes a position, but the support is inadequate.

A typical AES system ingests thousands of these examples. In the early days, researchers used heuristics—manually coding rules like “if the essay has 5 paragraphs, give it a higher score for Organization.” Later, the field moved to Machine Learning (ML) approaches, counting features like sentence length, vocabulary richness, and grammar errors.

Today, the standard is Deep Learning. Researchers use neural networks (like LSTMs or Transformers such as BERT) to learn a “representation” of the essay automatically. These models read the text and predict the score without needing manual feature engineering.

The Metric: Quadratic Weighted Kappa (QWK)

How do we know if a model is good? The industry standard metric is Quadratic Weighted Kappa (QWK).

QWK measures agreement between the computer and the human, but it’s smarter than simple accuracy. It penalizes “far misses” much more than “near misses.”

If a human gives a score of 4 and the computer guesses 3, that’s a small penalty.
If a human gives a score of 4 and the computer guesses 1, that’s a massive penalty.

This metric dominates the field. A successful paper in AES usually follows a predictable pattern: propose a new neural architecture, run it on the standard dataset, and show that your QWK is 0.01 higher than the previous state-of-the-art model.

The Problem: The “Leaderboard” Trap

The authors argue that this focus on incrementally improving QWK on a single dataset is dangerous. It encourages “thinking small.”

If the entire field is optimizing for a single number, we risk overfitting to the quirks of a specific dataset rather than solving the actual problem of evaluating writing. We might build systems that are great at guessing scores on a specific set of 8th-grade essays but are useless in a real-world classroom or for a student learning English as a second language.

Li and Ng propose that we need to stop looking at the scoreboard and start looking at the game. They categorize their reflections into several key areas: Evaluation, Data, Tasks, and the role of Large Language Models (LLMs).

Reflection 1: Evaluation and Interpretability

Recommendation #1: Look beyond the QWK score.

Deep learning models are often “black boxes.” A BERT-based model might process the essay in Figure 1 and correctly output a “3,” but can it tell us why?

In the pre-neural era, models were interpretable by design. If a regression model gave a low score, you could look at the weights: “Oh, the model penalized this essay because the sentence complexity score was low.”

With modern neural networks, the model learns a dense vector representation of the text. We don’t know if it gave the essay a “3” because of the repetitive transitions (“So as you can see”) or simply because the essay was short.

The authors urge researchers to perform rigorous error analysis. Instead of just reporting an average score, researchers should ask:

Does the model perform better on persuasive essays than narrative ones?
Is it biased against minority classes (e.g., very high or very low scoring essays)?
Is it actually detecting “coherence,” or is it just counting words?

Without understanding why a model improves performance, the improvement is scientifically hollow.

Reflection 2: The Monoculture of Data

Recommendation #2: Evaluate on more than just the ASAP corpus.

In the world of AES, one dataset rules them all: the Automated Student Assessment Prize (ASAP) corpus. Released in a 2012 Kaggle competition, it contains thousands of essays written by US students in grades 7–10.

While ASAP is a great resource, relying on it exclusively creates blind spots:

Demographics: ASAP essays are mostly by native English speakers (or students proficient in English) in the US school system. They don’t reflect the specific struggles of English as a Second Language (ESL) learners, who make different types of grammatical and lexical errors.
The Length Confound: In timed test settings (like the ones where ASAP was collected), the length of an essay is highly correlated with its score. Longer essays usually get better grades. Models trained on ASAP often learn a simple heuristic: “Longer = Better.”

If you take a model trained on ASAP and apply it to a homework assignment where students had unlimited time (and thus everyone wrote a long essay), the model might fail spectacularly because length is no longer a useful signal.

The authors recommend that researchers must test their systems on diverse corpora, such as CLC-FCE (Cambridge Learner Corpus) or ICLE (International Corpus of Learner English), to prove their models are robust across different writing conditions and writer backgrounds.

Reflection 3: The Challenge of Cross-Prompt Scoring

Recommendation #3: Tackle the hard problem of Cross-Prompt AES.

Most AES research is done in a Within-Prompt setting. This means:

You have 1,000 essays written for the prompt “Effect of Computers.”
You train on 800 of them.
You test on the remaining 200 for the exact same prompt.

This is “easy” mode. The model doesn’t need to understand the question; it just needs to recognize the vocabulary patterns of high-scoring essays for that specific topic.

The real world doesn’t work like that. A teacher wants a system that can grade an essay on a new topic that the model has never seen before. This is Cross-Prompt AES, and it is notoriously difficult.

To grade a new prompt effectively, a model needs two things current systems lack:

World Knowledge: If the new prompt is about “Capital Punishment,” the model needs to know what arguments are relevant to that topic to judge if the essay is persuasive.
Rubric Awareness: What counts as a “good” essay might change. A narrative essay about a summer vacation has different criteria than a persuasive essay about politics.

The authors critique current cross-prompt attempts for ignoring the prompt and rubric entirely. They argue we need models that explicitly read the new prompt and rubric (perhaps using LLMs to extract knowledge) to adjust their scoring criteria dynamically.

Reflection 4: The Power of Traits

Recommendation #4: Use specific traits to solve interpretability and generalization.

Currently, most systems output a Holistic Score—a single number summarizing quality. But looking back at Figure 1, the rubric isn’t just a single number. It describes specific dimensions:

Position Taking
Organization
Sentence Fluency
Audience Awareness

These are called Traits.

The authors argue that traits are the key to solving the problems mentioned in Reflections 1 and 3.

For Interpretability: If a model predicts a holistic score of 3, but also predicts a “Low” score for Organization and a “High” score for Grammar, the student knows exactly what to fix.
For Cross-Prompt Generalization: While the vocabulary of an essay changes from topic to topic, the concept of “good organization” remains relatively stable. A well-structured paragraph looks similar whether you are writing about computers or climate change.

By focusing on modeling these specific traits rather than just the final score, researchers can build systems that are more robust and more helpful to students.

Reflection 5: The Data Bottleneck

Recommendation #5: Prioritize corpus development and shared visions.

Why are we still using a dataset from 2012 (ASAP)? Because building essay datasets is incredibly hard.

Unlike scraping tweets or news articles, you cannot easily scrape student essays due to privacy concerns. Furthermore, annotating them is expensive. You need trained teachers to read them, not just random crowd-workers.

The authors highlight a “Catch-22”:

We need better models (like Multi-Trait scoring).
To build them, we need data annotated with traits.
But annotating traits is too expensive, so no one creates the data.
So researchers stick to the old data (ASAP).

To break this cycle, the authors call on large organizations (like ETS or universities) to release raw, unannotated essays or models pre-trained on them. They also suggest the community needs to agree on a “shared vision” for annotation. If we are going to spend money annotating, what layers do we need? Errors? Argument strength? Coherence?

Reflection 6: The Role of LLMs

Recommendation #6: Use LLMs to assist, not just to replace.

With the rise of ChatGPT and GPT-4, the immediate reaction in the AES community was, “Can we just prompt the LLM to grade the essay?”

Early results show that while LLMs are impressive, they are not yet better than fine-tuned, specialized models for scoring. They are sensitive to the specific wording of the prompt and can be inconsistent.

However, Li and Ng suggest a smarter way to use LLMs: Augmentation.

Instead of asking the LLM to give the final score, we can use it to handle the “grunt work” of research:

Corpus Creation: Ask LLMs to generate synthetic essays at specific proficiency levels to train smaller, specialized models.
Annotation Assistance: Ask an LLM to “pre-score” essays or highlight potential grammar errors. A human expert then reviews the LLM’s work. It is much faster to verify an LLM’s critique than to grade an essay from scratch.

This “Human-in-the-loop” approach leverages the generative power of LLMs to solve the data scarcity problem discussed in Reflection 5.

Reflection 7: Beyond Grading

Recommendation #7: Integrate AES into education systems.

Finally, the paper reminds us that a score is not an endpoint. In a classroom, a grade is meant to be part of a feedback loop.

Current AES research treats the score as the final product. The authors encourage researchers to think about Intelligent Tutoring Systems. How can a scoring model be embedded into a writing assistant that helps a student while they are drafting?

If we combine trait-specific scoring (Reflection 4) with feedback generation, we move from “Automated Essay Scoring” to “Automated Writing Evaluation,” which has a much higher potential to improve educational outcomes.

Conclusion: Moving the Field Forward

The field of Automated Essay Scoring stands at a crossroads. We have mastered the art of training neural networks to predict a number on the ASAP dataset. But as Li and Ng argue, maximizing QWK is a short-term game.

To build the next generation of writing technologies, researchers need to accept that the problem is messier than a leaderboard. We need:

Deeper analysis of why models behave the way they do.
Broader datasets that reflect the reality of global learners.
Trait-based models that provide actionable feedback.
Collaborative use of LLMs to build the infrastructure the field lacks.

By shifting focus from “beating the state of the art” to “advancing the understanding of writing,” the AES community can deliver on the promise made 50 years ago: saving teachers time and helping students write better.

The Status Quo: How AES Works Today#

The Ingredients#

The Metric: Quadratic Weighted Kappa (QWK)#

The Problem: The “Leaderboard” Trap#

Reflection 1: Evaluation and Interpretability#

Reflection 2: The Monoculture of Data#

Reflection 3: The Challenge of Cross-Prompt Scoring#

Reflection 4: The Power of Traits#

Reflection 5: The Data Bottleneck#

Reflection 6: The Role of LLMs#

Reflection 7: Beyond Grading#

Conclusion: Moving the Field Forward#