Beyond Chatbots: Unlocking the Hidden Classification Power of Large Language Models
When we think of Large Language Models (LLMs) like GPT-4 or Llama, we usually think of generation. We use them to write emails, debug code, or compose poetry. But there is a massive subset of Natural Language Processing (NLP) where generation takes a back seat to precision: Classification.
Can a model designed to chatter actually be a rigorous classifier?
This is the central question of a fascinating research paper titled “Are Large Language Models Good Classifiers?” The researchers dive deep into Edit Intent Classification (EIC)—a complex task that involves understanding why a writer changed a sentence.
In this post, we will break down their novel framework for turning generative models into discriminative classifiers, explore the architectures they tested, and look at how they used their best model to create a massive dataset of scientific revisions.

As shown in Figure 1 above, the study isn’t just about modeling; it’s a complete pipeline. The researchers (1) developed a framework to test LLMs as classifiers, (2) used the winning model to build a dataset called Re3-Sci2.0, and (3) analyzed human editing behavior in scientific papers.
Let’s dig in.
The Problem: Why Classification is Hard for LLMs
Historically, if you wanted to classify text (e.g., sentiment analysis), you used a model like BERT. You fine-tuned it, and it gave you a label.
With the rise of LLMs, the paradigm shifted. The standard way to use an LLM for classification is Generative (Gen): you give the model a prompt like “Is this sentence happy or sad?” and hope it generates the text string “Happy.”
However, this approach has flaws:
- Hallucination: The model might generate text that isn’t one of the allowed labels.
- Inefficiency: Generating text token-by-token is computationally slower than just outputting a probability vector.
- Prompt Sensitivity: Performance varies wildly based on how you word the instructions.
The researchers chose Edit Intent Classification (EIC) as their testing ground. EIC is notoriously difficult because the model must compare two versions of a sentence (Old vs. New) and decide if the change was for Grammar, Clarity, Fact/Evidence, Claim, or Other. It requires a deep understanding of nuance, making it the perfect stress test for LLMs.
The Core Method: A Framework for Fine-Tuning
The heart of this paper is a systematic framework that moves beyond simple prompting. The researchers propose four distinct approaches to fine-tuning LLMs for classification.

Figure 2 outlines these four approaches. Let’s break down how each one works.
1. Approach Gen: The Standard Generative Route
Shown as (a) in the figure above, this is the “classic” LLM approach. You feed the model the instruction, the old sentence (\(S_o\)), and the new sentence (\(S_n\)). The model is fine-tuned to generate the label string (e.g., “Grammar”) as its output.
While intuitive, this method suffers from the “Answer Inclusion Rate” problem—sometimes the model just doesn’t output a valid label.
2. Approach SeqC: Sequence Classification
Shown as (b), this approach treats the LLM like an encoder (similar to BERT). Instead of asking the model to write a word, the researchers take the hidden state of the last token (often the end-of-sequence token).
They attach a simple linear classification layer to this embedding. The model projects the high-dimensional understanding of the text directly into a probability distribution over the labels. This completely removes the risk of the model “chatting” instead of classifying.
3. Approach SNet: The Siamese Network
Shown as (c), this architecture is designed specifically for comparing two inputs.
- The Old Sentence (\(S_o\)) goes through the LLM.
- The New Sentence (\(S_n\)) goes through the same LLM (conceptually a “twin” or Siamese network).
- We extract the embeddings for both.
- These two embeddings are combined using a Transformation Function (more on this below) before going to the classifier.
This separates the processing of the two sentences, forcing the model to understand them individually before comparing them.
4. Approach XNet: The Cross Network
Shown as (d), this is a hybrid. Both sentences are fed into a single LLM simultaneously (allowing the self-attention mechanism to look at both sentences at once). However, instead of generating text, the system extracts the embeddings for the specific end-tokens of the old and new sentences. These two embeddings are then combined via a transformation function.
The Secret Sauce: Transformation Functions
For SNet and XNet, the model ends up with two vectors: \(o\) (representation of the old sentence) and \(n\) (representation of the new sentence). How do we combine them to find the “intent” of the edit?
The researchers proposed five mathematical functions to merge these vectors (\(u\) is the final combined vector).
1. Difference (\(f_{diff}\))
This captures the directional change vector. If you think of sentences as points in space, this vector points from the old meaning to the new meaning.

2. Absolute Difference (\(f_{diffABS}\))
Sometimes the direction matters less than the magnitude of the change. This captures the “distance” between the sentences without direction.

3. New + Absolute Difference (\(f_{n-diffABS}\))
This concatenates (joins) the new sentence representation with the absolute difference. This gives the classifier context about the final state of the sentence plus the magnitude of the change.

4. New + Old (\(f_{n-o}\))
A simple concatenation of both embeddings.

5. All Combined (\(f_{n-diffABS-o}\))
This throws everything into the mix: the new sentence, the difference, and the old sentence.

Experiments & Results: Who Won?
The researchers tested these approaches using Llama-2 (7B and 13B), Llama-3 (8B), Mistral, and older PLMs like RoBERTa and T5.
The Verdict on Approaches
The results were compelling. Approach SeqC (Sequence Classification) emerged as the winner.
Why? It turns out that LLMs are incredibly strong encoders. When you strip away the generation head and just use the internal representations with a linear classifier, you get:
- State-of-the-art Accuracy: It outperformed the Generative approach and traditional PLMs.
- 100% Reliability: Because it’s a classifier layer, it always outputs a valid label (Perfect Answer Inclusion Rate).
- Speed: It is significantly faster.

Figure 3 illustrates this trade-off perfectly using Llama2-13B. Look at the SeqC point:
- Performance (Blue): Highest among the group (around 85%).
- Efficiency (Red): Drastically higher than Gen. It processes many more samples per second because it doesn’t have to generate tokens auto-regressively.
- AIR (Yellow): A perfect 100%.
In contrast, look at Gen. It’s slow (negative efficiency score on this relative scale) and has lower performance.
The Verdict on Transformation Functions
For the Siamese and Cross networks, the Absolute Difference (\(f_{diffABS}\)) and Concatenated Difference (\(f_{n-diffABS}\)) functions performed best. This confirms intuition: to understand an edit, the model explicitly needs to know the difference between the embedding vectors.
Generalization
To prove this wasn’t a fluke, they tested the findings on five other datasets (like identifying duplicate questions or classifying emotions). The pattern held: LLMs fine-tuned via SeqC consistently achieved State-of-the-Art results, beating fully fine-tuned RoBERTa models.
Application: The Re3-Sci2.0 Dataset
Having identified the best model (Llama2-13B using SeqC), the researchers put it to work. They processed thousands of scientific papers to create Re3-Sci2.0, a dataset of 1,780 document revisions containing over 94,000 labeled edits.
This dataset allows us to peek into the minds of scientists. How do they edit their papers?
Where do scientists edit?

Figure 4 shows the distribution of edits across different sections of a paper (normalized from 0% to 100% of the document length).
- NLP papers (Top row, first column): Notice the spike at the very end? NLP researchers tend to heavily revise their Conclusions and Discussion sections, often changing Claims (Green) and Evidence (Yellow).
- Medical papers (Middle row): Edits are more distributed but show intensity in the “Results” sections (70-90% range).
What do they edit?

Figure 5 breaks down the types of edits.
- Clarification (Red) and Grammar (Blue) are the most common changes across all fields.
- Social Sciences (soc) and NLP differ in deletions. NLP authors delete facts/evidence frequently (perhaps removing outdated baselines?), while Social Science authors focus heavily on refining Claims (Green).
Success vs. Failure
Perhaps the most interesting finding is the correlation between editing and success. The researchers compared papers that improved their review scores vs. those that didn’t.
They found that successful revisions involve:
- Significantly more edits in total.
- A focus on Clarity and modifying Claims.
- Adding new Fact/Evidence.
Simply fixing grammar was not a statistically significant predictor of a paper’s success. Reviewers want clearer arguments and more evidence, not just better punctuation!
Conclusion
This paper challenges the assumption that LLMs are just for chatting. It demonstrates that when we treat LLMs as powerful encoders and wrap them in a classification framework (specifically SeqC), they outperform traditional methods in accuracy, reliability, and inference speed.
The study offers a clear takeaway for students and practitioners: If you need to classify complex text data, don’t just ask ChatGPT to generate a label. Fine-tuning an open-source LLM (like Llama-3 or Mistral) using a classification head is a far more robust strategy.
Furthermore, the release of the Re3-Sci2.0 dataset opens new doors for “Science of Science”—using NLP to understand how scientific knowledge evolves through the revision process.
This post summarizes the research by Ruan, Kuznetsov, and Gurevych. For the full mathematical details and experimental setup, refer to the original paper.
](https://deep-paper.org/en/paper/2410.02028/images/cover.png)