Grading an essay is subjective, nuanced, and exhausting. A teacher doesn’t just look at a paper and say “8 out of 10.” They evaluate structure, vocabulary, grammar, and content simultaneously. Automated Essay Scoring (AES) systems attempt to replicate this, but they have historically faced a significant technical hurdle: a mismatch between how they are trained and how they are evaluated.
Most AES systems are trained to minimize simple error margins (like Mean Squared Error), but they are evaluated in the real world using a metric called Quadratic Weighted Kappa (QWK). QWK measures how well the AI agrees with a human grader, heavily penalizing large discrepancies. The problem? QWK is mathematically “non-differentiable,” meaning you can’t easily use it to train a neural network using standard backpropagation.
In this post, we explore a new framework proposed by researchers at POSTECH, dubbed SaMRL (Scoring-aware Multi-reward Reinforcement Learning). This method bridges the gap by using Reinforcement Learning to optimize for QWK directly, treating essay scoring not as a classification problem, but as a text generation task.
The Core Problem: The Metric Mismatch
To understand why SaMRL is necessary, we first need to look at how traditional AES models work. Generally, they fall into two camps:
- Regression Models: These treat the score as a continuous number. They are trained using Mean Squared Error (MSE). They are good at getting close to the target but don’t inherently understand the “categories” of grades.
- Classification Models: These treat every possible score (e.g., 1, 2, 3, 4) as a distinct class. They output probabilities for each class.
However, the gold standard for evaluating these systems is the Quadratic Weighted Kappa (QWK). QWK is sophisticated: it cares about the ordering of scores. Confusing a score of 1 with a 2 is a minor error; confusing a 1 with a 10 is a massive error. Standard Cross-Entropy loss (used in classification) treats all errors roughly the same.

As shown in Figure 1, the researchers propose moving away from pure regression or classification. Instead, they utilize an Autoregressive Framework. This means the model generates the score as a sequence of text tokens (e.g., generating the text “Trait 1 Score 3”). This generation process produces probability distributions, which unlocks the ability to use Reinforcement Learning (RL).
With RL, we don’t need a differentiable loss function. We just need a “Reward.” If the model generates a score that results in a high QWK agreement with humans, we give it a positive reward. If it fails, we give it a negative one.
The SaMRL Method: A Deep Dive
The heart of the paper is SaMRL. It is designed to score essays on multiple traits (e.g., Content, Organization, Fluency) simultaneously.
1. The Architecture
The model uses a T5 (Text-to-Text Transfer Transformer) architecture. Given an essay, it doesn’t just output a number. It generates a string of text describing the traits and scores, such as:
Content 3, Organization 4, Fluency 3...
This turns scoring into a language generation task. Because the model predicts the probability of the next token, the researchers can use a policy gradient algorithm (specifically PPO) to train it.
2. The Reinforcement Learning Loop
The training process is visualized below. The “Policy” (the AI being trained) looks at an essay and generates scores. These scores are compared against the human labels to calculate rewards.

There are two critical components here ensuring the model learns correctly:
- The Anchor Model: A frozen copy of the original model (shown in blue). We use KL Regularization to ensure the trained model (Policy) doesn’t drift too far from the Anchor. This ensures the model keeps generating valid text formats and doesn’t devolve into gibberish just to game the reward system.
- Scoring-Aware Rewards: This is the primary contribution of the paper.
3. Designing the Perfect Reward
You might think, “Just use QWK as the reward.” However, QWK is usually calculated over a batch of essays, not a single one. Using a single batch-level metric can lead to unstable training because every essay in the batch gets the exact same reward, regardless of individual quality.
To solve this, SaMRL uses a Multi-Reward System consisting of three distinct signals.
Signal A: Batch-wise QWK (\(Q_B\)) This is the standard metric. It checks the agreement between the model’s predictions and human labels across the entire batch of essays.
\[ Q _ { B } = 1 - \frac { \sum _ { i , j } W _ { i , j } C _ { i , j } } { \sum _ { i , j } W _ { i , j } E _ { i , j } } \]
Signal B: Trait-wise QWK (\(Q_T\)) This is a novel addition. Instead of just looking at the batch, the model calculates QWK within a single essay across its multiple traits (Content, Organization, etc.). This provides a sample-specific reward signal, allowing the model to understand which specific essays it scored well.
The researchers combine these two into a Bidirectional QWK Reward (\(r_Q\)):
\[ \begin{array} { r } { r _ { Q } ( S , \hat { S } ) = \lambda \cdot Q _ { B } + ( 1 - \lambda ) \cdot Q _ { T } } \end{array} \]
Signal C: The MSE Penalty (\(r_M\)) Reinforcement learning can sometimes “game” metrics, losing sight of the absolute numerical values. To ground the model, the researchers introduce a Mean Squared Error (MSE) reward. This acts as a stabilizer, punishing the model if the raw numerical distance between the predicted score and the human score is too large.
\[ r _ { M } ( \boldsymbol { S } , \hat { \boldsymbol { S } } ) = - \frac { 1 } { m } \sum _ { j = 1 } ^ { m } \sqrt { \frac { 1 } { n } \sum _ { i = 1 } ^ { n } ( s _ { i j } - \hat { s } _ { i j } ) ^ { 2 } } \]
4. Putting it all Together
The final training objective treats this as a Multi-Task Learning problem. The total loss combines the Policy Gradient loss (driven by the rewards above) with the KL divergence regularization.
The weights for these losses (\(w_Q\) and \(w_M\)) are not fixed constants. They are learnable parameters. This means the model dynamically decides whether it needs to focus more on improving QWK or reducing MSE at any given point during training.
\[ l o s s _ { t o t a l } = w _ { Q } l o s s _ { R _ { Q } } + w _ { M } l o s s _ { R _ { M } } \]
Experimental Results
The researchers tested SaMRL on the ASAP and ASAP++ datasets, which are standard benchmarks for essay scoring. They compared their method against the previous state-of-the-art model, ArTS (Autoregressive Trait Scoring).
State-of-the-Art Performance
The results were compelling. SaMRL achieved higher QWK scores across almost every trait.

In Table 2, we see the results broken down by trait (Content, Language, Organization, etc.). SaMRL (the rows labeled “Ours”) consistently beats the ArTS baseline. The improvement is statistically significant, validating that optimizing for the evaluation metric (QWK) directly yields better final performance than indirect optimization.
Robustness Across Prompt Types
One of the most interesting findings was how the model behaved on different types of essays. The dataset contains “Argumentative” essays (Prompts 1, 2, 8) and “Source-Dependent” essays (Prompts 3-6).
Argumentative essays in this dataset typically have a much wider score range (e.g., 0 to 60) compared to source-dependent ones (0 to 4). Classification models notoriously struggle with wide ranges because the number of classes becomes unmanageable.

As shown in Figure 3 (Left Chart), SaMRL (red triangles) significantly outperforms the baseline (blue circles) on prompts P1, P2, and P8. This indicates that the RL approach effectively handles the complexity of wide score ranges, a historical weakness of classification-based AES.
Why the Multi-Reward System Matters
Was the complex reward system necessary? The researchers performed an ablation study to find out. They tested the model using only MSE rewards (\(SaSRL_M\)), only QWK rewards (\(SaSRL_Q\)), and then split the QWK into unidirectional components.

Table 4 reveals two key insights:
- MSE helps significantly: The model using only MSE (\(SaSRL_M\)) actually performed quite well, proving that simply adding a regression penalty to a generative model is beneficial.
- Combination is King: The full SaMRL model (\(SaMRL\_biQ\)), which combines Bidirectional QWK and MSE, achieved the highest scores. The interaction between the “agreement” metric (QWK) and the “precision” metric (MSE) creates a synergistic effect.
Comparisons to Classification RL
Finally, it is worth looking at how this generative RL approach compares to older attempts that applied RL to classification models.

Figure 4 paints a clear picture. The classification-based RL models (green circles and blue triangles) struggle significantly on Prompt 8 (P8), which has the widest score range (0-60). SaMRL (red star) maintains high performance. This confirms that treating scoring as a generation task is superior to treating it as a classification task when applying Reinforcement Learning.
Dynamic Learning Dynamics
A subtle but cool feature of SaMRL is the learnable weights for the loss functions. Instead of hard-coding how much the model should care about QWK vs. MSE, the researchers let the model learn these weights.

The graph on the left shows the evolution of these weights. Interestingly, the weight for MSE (\(W_{MSE}\), red line) increases over time, while the weight for QWK (\(W_{QWK}\), blue line) decreases. This suggests that in the early stages, the model benefits from the structural feedback of QWK, but as it converges, precise error minimization via MSE becomes the dominant driver for fine-tuning.
Conclusion and Key Takeaways
The SaMRL paper presents a significant step forward in Automated Essay Scoring. By shifting the paradigm from simple regression or classification to Autoregressive Generation optimized via Reinforcement Learning, the authors have solved the long-standing problem of the “non-differentiable QWK.”
Key Takeaways for Students:
- Align Training with Evaluation: If your metric is QWK, try to train on QWK. RL allows you to optimize for metrics that you can’t differentiate.
- Don’t Ditch MSE: Even in complex RL setups, simple rewards like Mean Squared Error provide necessary grounding and stability.
- Generative Models can do Regression: Using text-generation models (like T5) to output numbers allows for more flexible probability distributions than standard regression heads.
This research highlights how “old” techniques like Reinforcement Learning can be creatively applied to “new” architectures like Transformers to solve specific, nuanced problems in Natural Language Processing.
](https://deep-paper.org/en/paper/2409.17472/images/cover.png)