Language is a tricky beast. Consider these two sentences:

  1. She ran to the mountains.
  2. She ran in the mountains.

Syntactically, they look almost identical. They both follow a “Subject + Verb + Prepositional Phrase” structure. A basic parser might look at these and see the exact same tree: a noun, a verb, and a modifier.

But as a human reader, you know they mean fundamentally different things. The first sentence describes motion toward a goal; the prepositional phrase “to the mountains” is an argument required to complete the meaning of the movement. The second sentence describes an activity happening in a location; “in the mountains” just sets the scene.

This distinction is at the heart of Argument Structure Constructions (ASCs). While humans pick up on these nuances effortlessly, teaching computers to distinguish them—especially for evaluating language learners—is a massive challenge in Natural Language Processing (NLP).

In a fascinating paper titled “Leveraging pre-trained language models for linguistic analysis,” researchers Hakyung Sung and Kristopher Kyle investigate whether modern AI—specifically Pre-trained Language Models (PLMs) like RoBERTa and GPT-4—can finally crack this code.

The Problem: When Syntax Isn’t Enough

For years, researchers trying to model language learning have relied on automatic tools to parse sentences. The goal is often to assess “linguistic complexity”—how advanced a student’s writing is. However, traditional tools often fail when form and meaning don’t align perfectly.

As illustrated below, standard dependency parsers might label the structures identically, missing the semantic “flavor” of the clause.

Figure 1: Distinguishing semantic roles in similar dependency structures of two different types of ASCs, visualized by DisplaCy

In the top example of Figure 1, the construction implies a destination. In the bottom, it implies a location. This is where the concept of ASCs comes in. ASCs are patterns that carry meaning independent of the specific words used. For example, the “Caused-Motion” construction (e.g., She put the book on the table) always implies X causing Y to move to Z.

To model how humans (both native speakers and language learners) acquire language, we need tools that can identify these specific constructions reliably. The researchers posed a critical question: Can we use the “knowledge” stored inside massive AI models to identify these constructions better than before?

The Contenders: Encoder vs. Decoder

The study sets up a showdown between two different philosophies of using Large Language Models:

  1. Supervised Learning with Encoders (RoBERTa): Taking a model designed to understand context and fine-tuning it on a high-quality, human-annotated dataset.
  2. Prompting with Decoders (GPT-4): Asking a massive generative model to act as an annotator or a data generator, utilizing its “zero-shot” or “few-shot” capabilities.

The authors designed three distinct experiments to see which method reigns supreme.

Figure 2: Experiment overview

As shown in the experiment overview above, the study explores three paths:

  • Experiment 1: Human annotations are fed into RoBERTa to train a specialist model.
  • Experiment 2: GPT-4 is given prompts to directly tag new sentences.
  • Experiment 3: GPT-4 is asked to write sentences (generate data), which are then used to train RoBERTa.

Understanding the Target: What Are We Looking For?

Before diving into the results, it is helpful to understand exactly what the models were asked to find. The researchers used a dataset called the ASC Treebank, which contains sentences from native speakers (L1) and language learners (L2).

They focused on mapping specific semantic frames to ASC tags. Here is the breakdown of the constructions they were hunting for:

Table 1: ASCs representation

For instance, if a model sees “I gave him the address,” it needs to label it as Ditransitive (DITRAN) because it follows the Agent-Verb-Recipient-Theme pattern. If it sees “Money may become tight,” it labels it Intransitive Resultative (INTRAN_RES).

Experiment 1: The Specialist (RoBERTa)

In the first experiment, the researchers utilized RoBERTa, a transformer-based encoder model. Unlike GPT, which generates text, RoBERTa is exceptional at looking at a whole sentence at once and understanding the relationship between words.

They treated ASC identification as a Named Entity Recognition (NER) task. Just as an AI might be trained to highlight “New York” as a Location, RoBERTa was trained to highlight specific verbs and tag them with the correct ASC label (like CAUS_MOT or TRAN_S).

They trained the model using “Gold Standard” data—sentences painstakingly annotated by humans. They tested different combinations of training data, including native English (L1) and learner English (L2).

The Result: The model trained on combined L1 and L2 Gold data was a powerhouse. It achieved F1 scores (a measure of accuracy) above 0.90 across the board. This proves that with high-quality training data, encoder models like RoBERTa can learn to identify these subtle linguistic constructions with near-human precision.

Table 2: F1-scores across ASC types, models, and registers

Experiment 2: The Generalist (GPT-4 as Annotator)

Collecting “Gold Standard” data is expensive and slow. It requires trained linguists to read thousands of sentences. So, the researchers asked: Can we just ask GPT-4 to do the tagging?

They tested GPT-4 in three settings:

  1. Zero-shot: “Here is a sentence. Tag the ASCs. No examples provided.”
  2. 3-shot: “Here are 3 examples of how to do it. Now tag this sentence.”
  3. 10-shot: “Here are 10 examples. Now tag this sentence.”

Figure 3: Example of prompting GPT-4 to generate ASC labels in a zero-shot setting

The Result: GPT-4 struggled.

  • Zero-shot performance was poor (F1 score of 0.434). It simply didn’t grasp the specific linguistic definitions required without guidance.
  • Few-shot helped significantly. Providing 10 examples bumped the score up to 0.631.

However, even with examples, GPT-4 could not touch the benchmark set by the supervised RoBERTa model (which sat comfortably at ~0.88-0.91). While GPT-4 is smart, it lacks the consistent, granular precision of a fine-tuned specialist model for this specific linguistic task.

Table 3: F1-scores for use of GPT-4 as an annotator

Experiment 3: The Synthetic Solution?

If GPT-4 isn’t accurate enough to be the judge, maybe it can be the writer?

In the final experiment, the researchers tried to solve the “data scarcity” problem. Instead of paying humans to annotate data, they asked GPT-4 to generate thousands of sentences containing specific ASCs. They then used this synthetic data to train RoBERTa.

They tested two scenarios:

  1. Training RoBERTa only on GPT-generated data.
  2. Augmenting real Human data with GPT-generated data.

The Result: Training on synthetic data alone yielded mediocre results. The model trained on 10-shot generated data achieved an F1 of 0.605—far lower than the 0.812 achieved by training on a (smaller) set of real human data.

Even more surprisingly, augmenting the human data with synthetic data didn’t really help. As shown in the table below, adding GPT data (the “+3-shot” and “+10-shot” columns) often resulted in lower or equal scores compared to just using the gold data alone (the “gold1” column). The biggest jump in performance came from simply adding more human data ("+gold2").

Table 5: Comparison of F1-scores for ASC tagging using different training sets

Why did the synthetic data fail?

This is perhaps the most interesting finding of the paper. You might assume that data is data. If GPT-4 writes a transitive sentence, it should be good for training, right?

The problem lies in linguistic complexity. The researchers analyzed the sentences written by GPT-4 and compared them to the human sentences from the web corpus.

Human vs. GPT-4 generated Sentences

Look at the difference in the image above.

  • GPT-4 (Right): “He threw the ball to his dog.” “The bird flew out of the cage.”
  • Human (Left): “I doubt the very few who actually read my blog have not come across this yet…”

GPT-4’s sentences were grammatically correct but structurally simple. They lacked the messiness, the clauses, and the “real-world” noise of natural human language. When RoBERTa was trained on the clean, simple GPT sentences, it failed to learn how to handle the complex, messy sentences found in the real test set.

Conclusion: The Value of Human Annotation

This research tells a compelling story about the current state of NLP in linguistics.

  1. RoBERTa is highly effective: When provided with high-quality human data, encoder models can reliably identify complex argument structures, solving the ambiguity problem that plagued earlier parsers.
  2. There is no substitute for Gold Data: While Large Language Models like GPT-4 are impressive, they cannot yet replace human-annotated datasets for high-precision linguistic tasks. They struggle to replicate the complexity of natural language when generating training data, leading to the “garbage in, garbage out” (or rather, “simple in, simple out”) problem.

For students and researchers in applied linguistics, the takeaway is clear: Don’t throw away your manual annotation protocols just yet. Automated tools are becoming powerful allies for analyzing learner language and assessing proficiency, but they still rely heavily on the ground truth established by human experts.

As AI models continue to evolve, we may see better synthetic data generation or improved zero-shot capabilities. But for now, the “Gold Standard” remains the gold standard.