Can LLMs Grade Papers? Introducing Recurrent Alignment and Hard Attention

Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with text. They can write poetry, summarize emails, and even code. However, when you ask an LLM to perform a task that requires analyzing a complex, structured document—like an academic paper with dozens of citations—and assign it a specific numerical rating (such as a “disruption score”), the model often falters.

The struggle stems from two main issues: structure and precision. First, standard LLMs read text linearly, but real-world documents are often hierarchical (trees of information). Second, LLMs are probabilistic text generators, not calculators; they struggle to output precise, continuous numerical values directly.

In this post, we will deep dive into a fascinating paper titled “Recurrent Alignment with Hard Attention for Hierarchical Text Rating.” The researchers propose a novel framework called RAHA that allows LLMs to “read” hierarchical structures efficiently and “refine” their numerical predictions using a technique inspired by Markov Chains.

The Problem: When Linear Reading Fails

Imagine you are trying to determine if a scientific paper is “disruptive”—meaning it changes the trajectory of its field. To do this, you can’t just read the abstract. You need to understand the relationship between the paper (the root) and its references (the leaves).

If you feed a massive string of text containing the paper and all its references into an LLM, you run into the “Lost in the Middle” phenomenon. The model gets overwhelmed by the length and noise, losing track of the subtle connections that matter. Furthermore, simply asking an LLM to “rate this from 0 to 1” usually yields inconsistent results because the model is optimized for the next token prediction, not regression tasks.

Figure 1: A comparison between a typical LLM and our RAHA in processing hierarchical text rating task. While a typical LLM treats the input as plain text, our RAHA captures hierarchical structures and can straightforwardly provide task-specific rating score.

As shown in Figure 1 above, a standard LLM (left) treats the input as a flat sequence, often missing the mark (indicated by the red X). The RAHA framework (right), however, respects the tree structure of the data, processing the root and leaves in a way that yields a highly accurate numerical rating.

The Solution: RAHA Architecture

The RAHA framework tackles these challenges using a two-pronged approach: Hard Attention for processing input and Recurrent Alignment for refining output.

Let’s break down the architecture.

Figure 2: The overview of RAHA architecture. A frozen LLM determines connections and generates updates with hard attention scores to filter noise. RAHA incorporates an adapter and fully connected layer within a trainable LLM to predict text rating scores after aggregating updates.

As illustrated in Figure 2, the process is split into two distinct phases involving two different LLMs.

Phase 1: Tree-Based Hard Attention

The first challenge is noise reduction. Not every reference in a paper is crucial for understanding its disruptiveness. Some are just background noise.

RAHA employs a frozen LLM (an LLM whose weights are not updated during training) to act as a filter. Instead of feeding the whole document tree at once, the system breaks it down into pairs: <Root, Leaf> (e.g., <Main Paper, Reference 1>).

For each pair, the frozen LLM is prompted to produce two things:

Hard Attention Score (\(a\)): A binary value (0 or 1). Is this leaf relevant?
Symbolic Representation (\(d\)): A text summary or update vector describing the relationship.

This process can be mathematically represented as:

Equation showing the prompt generation and inference for hard attention.

Here, \(p\) is the prompt, and \(\mathcal{F}\) is the frozen LLM. This step effectively “prunes” the tree. If the attention score \(a\) is 0, that leaf is discarded. If it is 1, the information \(d\) is kept.

We then aggregate only the useful information. This filters out the noise before the heavy lifting begins:

Equation showing the filtering of updates using the hard attention score.

By the end of Phase 1, the massive hierarchical document has been compressed into a clean set of relevant insights (\(D^*_i\)), ready for the next step.

Phase 2: The Trainable Aggregator

Now that we have filtered the noise, we need to generate a rating. RAHA uses a second, trainable LLM.

Fine-tuning a massive LLM is computationally expensive. To get around this, the researchers use Parameter-Efficient Fine-Tuning (PEFT). They freeze the main weights of the LLM and inject small, trainable adapter layers (matrices \(A\) and \(B\)).

Equation showing the PEFT update rule.

Finally, a fully connected layer is added to the end of the LLM to project the high-dimensional hidden states into a single numerical score (\(y\)):

Equation showing the final output layer prediction.

The “Aha!” Moment: Recurrent Alignment

If the paper stopped there, it would be a solid engineering improvement. But the researchers introduced a third concept that makes this work truly unique: Recurrent Alignment.

Human reasoning is rarely “one-shot.” When we evaluate something complex, we form an initial opinion, review the evidence, adjust our opinion, and repeat until we are confident. RAHA mimics this using a Markov-like process.

How It Works

During inference, the model doesn’t just predict the score once. It performs multiple iterations.

Iteration 1: The model receives the Root, the Filtered Leaves, and a placeholder for the previous score (initialized to “None”). It predicts a score, say \(0.3\).
Iteration 2: The model receives the same text input, but the prompt now includes: “The previous predicted rating was 0.3”. The model re-evaluates and adjusts the score to \(0.45\).
Iteration K: This continues for \(K\) steps.

The prompt construction for this iterative process looks like this:

Equation showing the prompt including the previous prediction y*.

And the iterative cycle is defined as:

Equation showing the recursive inference steps.

Why This Matters (The Math of Stability)

By feeding the output back into the input, the system behaves like a Markov Chain. In probability theory, a Markov Chain transitions from state to state until it reaches a “stationary distribution”—a stable point where further transitions don’t change the state much.

The researchers provide a theoretical proof that this iterative process helps the model converge toward a stable, accurate representation. The prediction at step \(K\) can be viewed as a summation of previous transformations:

Equation showing the expansion of the iterative prediction process.

Assuming the neural network’s parameters behave well (specifically, if the spectral radius is less than 1, which is common in trained networks), this process converges mathematically:

Equation showing the limit of the prediction as t approaches infinity.

This implies that with enough iterations, the model naturally “settles” on the most mathematically consistent answer, bridging the gap between discrete text generation and continuous numerical rating.

Training the Beast

To train this system, the researchers use Mean Squared Error (MSE) as the loss function. They compare the predicted score at each iteration against the ground truth.

Equation showing the MSE loss function.

Interestingly, even though the model iterates multiple times during testing, it is trained to minimize the error at every step, reinforcing the ability to correct itself.

Experiments and Results

Does this actually work? The researchers tested RAHA on three hierarchical datasets:

DBLP: Computer science paper citation network.
PubMed: Biomedical literature citation network.
PatentsView: Patent citation network.

They compared RAHA against standard pre-trained models (SciBERT, RoBERTa) and massive LLMs (Llama3, GLM3).

Main Performance

The results, summarized in Table 1, show that RAHA consistently outperforms the baselines.

Table 1: A comparative results of various language models.

Key Takeaways from the Data:

LLMs > PLMs: Large models generally beat smaller Pre-trained Language Models (PLMs).
RAHA Boosts Everything: When RAHA is applied to any base model (e.g., Llama3-RAHA vs. Llama3), the performance improves significantly.
Ablation Studies: Removing “Hard Attention” hurts performance (proving filtering is necessary). Removing “Recurrent Alignment” also increases error (proving the loop works).

Visualizing the Iterative Improvement

The most compelling evidence for the Recurrent Alignment strategy comes from watching the error rates drop over time.

Figure 3: Comparison of predictions over multiple iterations during recurrent alignment across three datasets.

Look at Figure 3 (specifically graphs a, c, and e). The y-axis represents the Mean Absolute Error (MAE).

Initialization matters: When the model starts with “None” (graphs a, c, e), the error drops sharply after the first iteration and stabilizes. This confirms the model is learning to refine its guess.
Randomization hurts: When initialized with a random value (graphs b, d, f), the model struggles to converge effectively. This suggests that starting from a “blank slate” allows the model to build a logical reasoning path, whereas random values introduce bias that is hard to shake.

Convergence of Representation

Finally, the researchers looked at the “brain” of the model—the hidden representations. They measured the Kullback-Leibler (KL) divergence between the model’s current state and the “target” state (the state if the model knew the perfect answer).

Figure 4: A detailed analysis based on the Kullback-Leibler (KL) divergence over testing iterations across three datasets.

In Figure 4, the bars represent the difference (divergence) between the model’s thought process and the ideal thought process. Across all three datasets, this divergence shrinks as iterations progress. This empirical data backs up the theoretical Markov Chain claim: the model is literally “aligning” its internal representation with the truth, step by step.

Broader Implications

While this paper focuses on hierarchical text (papers and patents), the researchers also tested RAHA on plain text datasets (ASAP and Splunk) and found it still performed well.

Table 2: The performance of various language models on two text rating datasets, ASAP and Splunk.

This suggests that Recurrent Alignment is a general-purpose technique that could improve LLM performance on many regression or rating tasks, not just those with tree structures.

Conclusion

The RAHA framework offers a sophisticated solution to the limitations of current LLMs in handling complex, structured evaluations. By combining Hard Attention to filter structural noise and Recurrent Alignment to iteratively refine predictions, it turns a standard LLM into a precise rating machine.

For students and researchers in NLP, this paper highlights two critical lessons:

Structure matters: Treating all text as a flat sequence is suboptimal for real-world documents.
Iterative reasoning works: Allowing a model to “re-think” its output based on its own previous guess creates a feedback loop that drives accuracy.

As we continue to push LLMs into scientific and analytical domains, techniques like RAHA will be essential for moving beyond simple text generation toward reliable, quantitative reasoning.

The Problem: When Linear Reading Fails#

The Solution: RAHA Architecture#

Phase 1: Tree-Based Hard Attention#

Phase 2: The Trainable Aggregator#

The “Aha!” Moment: Recurrent Alignment#

How It Works#

Why This Matters (The Math of Stability)#

Training the Beast#

Experiments and Results#

Main Performance#

Visualizing the Iterative Improvement#

Convergence of Representation#

Broader Implications#

Conclusion#