Introduction

We live in an era where Large Language Models (LLMs) like GPT-4, LLaMA, and Mistral are passing bar exams, solving complex mathematical proofs, and writing code. We judge them based on “leaderboards”—massive lists of benchmarks that test their reasoning capabilities, world knowledge, and problem-solving skills.

But there is a fundamental question that often gets lost in the excitement over these high-level cognitive tasks: How good are these models at the basic mechanics of language?

Can a model that explains quantum physics actually construct a sentence with exactly three verbs if you ask it to? Can it control the depth of a syntactic tree? These might sound like simple questions, but they probe the deep linguistic proficiency of these AI systems.

A recent paper titled “Evaluating Large Language Models via Linguistic Profiling” by researchers from the ItaliaNLP Lab takes a step back from the task-oriented hype. Instead of asking models to solve riddles, they ask them to perform specific linguistic gymnastics. The results offer a fascinating look under the hood of how LLMs generate text, revealing that while they are powerful, they struggle significantly with precise structural control.

Figure 1: Illustrated examples of the evaluation methodology. An LLM is prompted to generate a sentence while adhering to a targeted linguistic constraint.

As shown in the figure above, the core problem is simple: If you ask an LLM to “Generate a sentence with 3 verbs,” does it actually do it? Or does it get distracted by the content and fail the structural constraint?

Background: From Profiling Humans to Profiling Machines

To understand this research, we need to understand Linguistic Profiling. Traditionally, this is a technique used in computational linguistics to analyze texts written by humans. By counting features—like the number of adjectives, the length of clauses, or the complexity of the syntax—linguists can determine authorship, identify a writer’s native language, or assess the complexity of a text.

The authors of this paper flipped the script. Instead of profiling existing text, they used this concept to test the generation capabilities of LLMs.

The Gap in Current Evaluations

Most current evaluations (like the OpenLLM Leaderboard) focus on what the model says (the answer). This study focuses on how the model constructs language.

There is a sub-field called “Controllable Text Generation” (CTG), where researchers try to make models generate text with a specific sentiment (e.g., “write a happy review”) or style. However, very few studies have rigorously tested whether models can handle strict morpho-syntactic (grammar and word type) and syntactic (sentence structure) constraints.

The hypothesis driving this work is intriguing: Just because a model has “learned” grammar implicitly during training doesn’t guarantee it can explicitly manipulate those grammatical rules on command.

The Core Method: Stress-Testing Syntax

The researchers devised a comprehensive methodology to profile five popular open-source LLMs: Gemma (2B and 7B), LLaMA-2 (7B and 13B), and Mistral (7B).

The approach was systematic. They defined a set of linguistic properties, selected specific target values for those properties, and then prompted the models to generate sentences matching those constraints.

1. The Linguistic Constraints

The team chose 20 specific properties to test, divided into two main categories:

Morpho-syntactic Properties (Word Level) These constraints deal with the types of words used (Part-of-Speech or POS tags).

Content Words: Nouns, Verbs, Adjectives, Adverbs, Proper Nouns.
Functional Words: Auxiliaries, Conjunctions, Determiners, Prepositions (ADP).

Syntactic Properties (Sentence Structure Level) These are much harder. They deal with how words relate to each other in a sentence structure (a dependency tree).

Tree Depth (max_depth): How deep is the syntactic tree? (A measure of sentence complexity).
Link Length (max_link): The distance between related words (e.g., a subject and its verb separated by a long clause).
Word Order: Controlling pre-verbal subjects or post-verbal objects.
Subordination: How many dependent clauses (like “although it rained”) are used, and where they are placed.

2. The Prompting Strategy

The researchers didn’t use vague requests. They used rigid prompt templates to ensure every model faced the exact same challenge.

Table 4: Prompts used for the generation of the sentence with the LLMs.

As you can see in Table 4, the prompts are direct: “Generate a sentence with [value] [property].”

3. Selecting Authentic Values

You can’t just ask a model for a sentence with 50 verbs—that’s not natural language. To keep the test fair, the researchers analyzed the English Universal Dependency (EWT) treebank, a massive dataset of real English sentences. They filtered for sentences between 5 and 40 words and extracted realistic ranges for these properties.

Table 5: The sets of property values used for the experiments.

Table 5 shows the increasing difficulty levels. For example, for Verbs, they asked models to generate sentences with 0, 1, 3, 5, or 7 verbs. This creates a “ladder” of difficulty to see if the model can handle increasing complexity.

4. Zero-Shot vs. Few-Shot

The experiment was run in two modes:

Zero-Shot: The model is just given the instruction (e.g., “Generate a sentence with 2 verbs”).
Few-Shot (5-shot): The model is given 5 examples of sentences that meet the criteria before being asked to generate a new one. This tests if the model can learn the pattern “in-context.”

Experiments & Results

So, how did the models perform? The researchers used two different metrics to grade the AI:

Success Rate (SR): Did the model get the exact number right? (Pass/Fail).
Spearman Correlation (\(\rho\)): Did the model follow the trend? (i.e., if asked for more adjectives, did it produce more, even if the exact count was wrong?)

Insight 1: Exact Precision is Hard

If you want an LLM to write a sentence with exactly 5 prepositions, you might be disappointed. The Success Rate was generally low across the board.

Figure 2: Success rate (%) for each linguistic property and each model in the 0- and 5-shot scenarios.

Figure 2 visualizes these success rates. Here are the key takeaways:

Mistral (Purple bars) is generally the superior model. Despite having only 7 billion parameters, it often outperformed the larger LLaMA-13B model.
Zero-Shot (Left panel): Most models struggle. Look at the low bars for syntactic features like max_depth and max_link. Models find it incredibly difficult to plan the geometry of a sentence tree in advance.
Few-Shot (Right panel): Performance generally improves when models are given examples. However, curiously, Mistral’s performance actually dropped in some few-shot scenarios, suggesting that extra context might sometimes confuse highly optimized smaller models or make them overfit to the specific examples provided.

Insight 2: Models Understand “More” and “Less”

While they failed the strict pass/fail test, the models showed high Spearman correlation.

This means that if you ask the model for “Value 1,” then “Value 3,” then “Value 5,” the model successfully increases the quantity of that feature, even if the absolute numbers are off (e.g., it might give you 2, 4, and 6).

Morpho-syntax is easier: Models are very good at adding more nouns, adjectives, or adverbs when asked.
Syntax is distinct: The correlation for syntactic depth (max_depth) was much lower. This confirms that categorical knowledge (what is a noun?) is easier for LLMs than relational knowledge (how do these clauses nest?).

Insight 3: The Ripple Effect (How Constraints Shape Sentences)

Language is interconnected. You usually can’t add more nouns without also adding more adjectives to describe them, or verbs to give them actions. The researchers analyzed how constraining one property affected all the others.

Figure 3: Correlations matrices between controlled and predicted values.

Figure 3 is a heatmap of these relationships. The Y-axis represents what the model was asked to control, and the X-axis is what actually appeared in the sentence.

The Diagonal: The dark red squares on the diagonal show that models generally increased the feature they were asked to increase.
Sentence Length (\(n\_tokens\)): Look at the far-right column of the heatmaps. It is almost entirely red. This indicates a strong positive correlation between almost any linguistic constraint and sentence length. If you ask for more of anything—verbs, depth, links—the model almost always solves the problem by just writing a longer sentence.

However, in the Few-Shot (bottom row) scenarios, notice how the “heat” (redness) changes. The models begin to specialize. For example, when Mistral is constrained to increase subordinating conjunctions (SCONJ), it correctly increases subordinate clauses (subord_prop), showing it understands the grammatical relationship between a conjunction and a clause.

Insight 4: Naturalness vs. Artificiality

Are these generated sentences “normal”? To check this, the researchers compared the statistical profiles of the AI-generated sentences against the gold-standard English text (EWT Treebank).

Figure 4: Correlation matrix of the EWT Treebank.

Figure 4 shows the “natural” correlations in English. For example, in real English, as sentences get longer (\(n\_tokens\)), the tree depth (max_depth) naturally increases (dark blue at the bottom right).

The researchers calculated the “Cosine Distance” between the AI’s patterns and this natural human pattern.

Table 3: Average cosine distances between the correlation matrix of EWT and predicted correlation matrices for each model.

Table 3 reveals a critical finding: Few-shot prompting makes sentences more natural. The distance scores (where lower is better) drop significantly from 0-shot to 5-shot. By seeing just five examples of real English sentences, the models adjusted their generation strategies to produce text that statistically resembles human language much more closely.

Conclusion & Implications

This paper provides a reality check for the capabilities of Large Language Models. While they can write poetry and code, their ability to adhere to strict, low-level linguistic constraints is still imperfect.

Here are the main takeaways for students and practitioners:

Size isn’t everything: The Mistral 7B model consistently outperformed the larger LLaMA 13B. This suggests that architecture and training data quality matter more than raw parameter count for linguistic precision.
Use specific metrics for specific goals:

If you need a model to follow a strict rule (e.g., “Write a Haiku” or “Summarize in exactly 10 words”), current LLMs might struggle. You should measure this with Success Rate.
If you want to control the style or complexity (e.g., “Make this text simpler” or “Make this more descriptive”), LLMs are excellent at following the trend. You should measure this with Correlation.

Prompting matters: Providing examples (few-shot) doesn’t just help the model get the right answer; it helps the model generate language that is statistically more “natural” and human-like.

As we move forward, evaluating LLMs shouldn’t just be about whether they get the right answer on a math test. It should also be about whether they truly command the building blocks of language. This “Linguistic Profiling” approach offers a robust new way to test the next generation of AI.

Introduction#

Background: From Profiling Humans to Profiling Machines#

The Gap in Current Evaluations#

The Core Method: Stress-Testing Syntax#

1. The Linguistic Constraints#

2. The Prompting Strategy#

3. Selecting Authentic Values#

4. Zero-Shot vs. Few-Shot#

Experiments & Results#

Insight 1: Exact Precision is Hard#

Insight 2: Models Understand “More” and “Less”#

Insight 3: The Ripple Effect (How Constraints Shape Sentences)#

Insight 4: Naturalness vs. Artificiality#

Conclusion & Implications#