Introduction

We are currently living in the “golden age” of Large Language Models (LLMs). From drafting emails to generating code snippets, models like GPT-4 and Llama-2 have integrated themselves into our daily workflows. When we benchmark these models, however, we often treat them like search engines: we ask a single question, get a single answer, and grade the result.

But is this how we actually use AI?

In the real world, interaction is rarely a one-shot event. We chat. We ask for revisions. We change the topic slightly, then circle back to an earlier point. We ask the model to “remember what I said three messages ago.” This is the domain of multi-turn interaction, and it is a significantly harder challenge for an AI than answering a standalone query.

While benchmarks like MMLU (Massive Multitask Language Understanding) tell us how much knowledge a model has, they don’t tell us if the model can hold a coherent conversation without losing the plot. To address this blind spot, a team of researchers from The Chinese University of Hong Kong and Huawei Noah’s Ark Lab introduced MT-Eval.

This paper proposes a comprehensive benchmark specifically designed to break down and evaluate the multi-turn capabilities of LLMs. In this post, we will dissect their methodology, explore the four types of conversational patterns they identified, and look at the sobering results: most models, even powerful ones, struggle significantly when the conversation keeps going.

The Context: Why Multi-Turn is Hard

Before diving into the specific methodology of MT-Eval, it is helpful to understand why multi-turn conversations are computationally and linguistically difficult for LLMs.

When a model processes a single prompt, it only has to attend to the immediate instructions. In a multi-turn conversation, the “context window”—the amount of text the model has to consider—grows with every exchange. The model must:

Retain History: Remember facts stated at the very beginning of the chat.
Ignore Irrelevance: Sift through previous turns to find what is relevant to the current query.
Maintain Consistency: Ensure the new answer doesn’t contradict a previous one.
Adapt to Change: Handle instructions that modify or refine previous constraints.

Existing benchmarks like MT-Bench attempt to measure this, but they are often limited to just two turns (a question and a follow-up). MT-Eval expands this horizon significantly, testing models over longer sessions to see where they break.

MT-Eval: The Methodology

The core contribution of this paper is the taxonomy of multi-turn interactions. The researchers analyzed real-world user data (from the LMSYS-Chat-1M dataset) and categorized human-AI interactions into four distinct patterns.

Figure 1: Illustration of the four dialogue tasks in MT-Eval: Recollection, Expansion, Refinement, and Follow-up.

As illustrated in Figure 1 above, the benchmark is structured around these four pillars. Let’s explore each one in detail.

1. Recollection

The Challenge: Can the model remember a rule set at the start of the conversation?

In this task, the user gives a global instruction in the very first turn. For example, “Start every response with the letter ‘C’” or “Don’t use any commas.” The conversation then proceeds with unrelated questions (distractors). The model fails if it answers the question correctly but forgets the formatting rule it agreed to ten turns ago. This tests long-term memory and instruction adherence over time.

2. Expansion

The Challenge: Can the model discuss a single topic from multiple angles?

Here, the user stays on one main subject (e.g., “The Hobbit”) but asks for various types of information—summaries, character details, or related facts. The model needs to understand that the context remains the same without the user explicitly repeating the book title in every prompt. It tests the model’s ability to maintain a topical “state.”

The Challenge: Can the model handle increasingly complex constraints?

This mimics a very common workflow: iterating on a draft.

Turn 1: “Write a summary of this text.”
Turn 2: “Make it JSON format.”
Turn 3: “Remove all adjectives.”

Each turn adds a new constraint or modifies an old one. The model must mentally stack these instructions. If it focuses only on the latest instruction (“Remove adjectives”) but forgets the previous one (“Make it JSON”), it fails. This measures the ability to manipulate context dynamically.

4. Follow-up

The Challenge: Can the model answer questions that depend on its own previous output?

In a Follow-up task, the user asks a question like “Why did you say that?” or “Tell me more about the second person you mentioned.” These queries are impossible to answer without understanding the model’s previous generation. This tests conversational coherence and self-reference.

Constructing the Benchmark

To ensure the benchmark was robust and did not suffer from data leakage (where models have already seen the test data during training), the authors constructed new queries using a hybrid approach. They used GPT-4 to generate synthetic tasks and documents—ensuring novelty—and then subjected them to human review.

Table 1: Key statistics of MT-Eval.

As shown in Table 1, the resulting dataset is substantial. It includes 1,170 turns across 168 dialogue sessions. The average prompt length is quite high (over 700 words), reflecting the complexity of the documents models are asked to process.

Experiments and Results

The researchers evaluated 11 popular Large Language Models. These included:

Closed-Source: GPT-3.5-Turbo, GPT-4.
Open-Source: Llama-2-chat (7B, 13B), Vicuna-v1.5, ChatGLM3, Qwen-chat, Mistral-Instruct, and Mixtral-Instruct.

They used GPT-4 as a judge to score responses on a scale of 1-10, a method that has been shown to correlate highly with human evaluation.

The Leaderboard

The overall results provide a snapshot of the current LLM landscape regarding conversational ability.

Table 2: Multi-turn performance in four dialogue tasks.

Table 2 reveals several key insights:

GPT-4 Dominance: Unsurprisingly, GPT-4 holds the crown with an average score of 9.03. It is the only model to consistently score high across all categories, particularly in Recollection (9.61), where other models struggle to remember instructions.
The Rise of Open Source: While closed models generally lead, models like Mixtral-Instruct-8x7B and Mistral-Instruct-7B are putting up a serious fight. In the Follow-up task, Mixtral actually scored 9.52, outperforming GPT-3.5-Turbo.
The “Recollection” Bottle-neck: Look at the scores in the “Recollection” column. While GPT-4 scores a 9.61, excellent models like ChatGLM3 and Llama-2-chat drop to the 2.9–3.8 range. This indicates a systemic failure in many models to hold onto constraints over long conversations.

The Performance Gap: Single vs. Multi-Turn

The most critical contribution of this paper is the comparison between single-turn and multi-turn performance. The researchers created single-turn versions of the queries to see how much performance degrades simply because a conversation is happening.

Table 3: Performance of various models across different dialogue tasks in both single-turn and multi-turn settings.

Table 3 (above) tells a concerning story. The numbers in brackets show the performance drop.

Llama-2-chat-13B drops by over 2 full points when moving from single to multi-turn.
ChatGLM3 and Mixtral also see significant degradation.
GPT-4 is the most robust, losing only 0.33 points.

This proves that a model might be excellent at answering a question in isolation (Single-Turn ST), but falls apart when that same question is part of a history (Multi-Turn MT). This “gap” is a metric of conversational fragility.

Why Do Models Fail?

The paper digs deep into why this degradation happens. They identified two primary culprits: Distance and Error Propagation.

1. The Distance Problem (Forgetting)

As a conversation progresses, the distance between the original instruction (e.g., “start sentences with C”) and the current turn increases.

Figure 3: The average number of turns that different models can adhere to the instructions in the Recollection task.

Figure 3 illustrates how long models can “hold on” to an instruction.

GPT-4 (Green bars): Remains consistent across almost all turns.
Open-Source Models: Often fail immediately or after just a few turns.

For example, in the “json_format” task, weaker models might provide JSON in the first turn but revert to plain text by turn 3. They simply “forget” the constraint as new tokens flood their context window.

In the Refinement task, instructions pile up. The model has to juggle the current request plus all valid previous requests.

Figure 2: Performance across turns in Refinement task.

Figure 2 shows the performance trajectory over turns in the Refinement task. You can see a general downward trend for almost every model. As the stack of constraints grows, the models struggle to satisfy all of them simultaneously. (The jump at turn 7 occurs because the task resets to a new topic, clearing the accumulated difficulty).

3. Error Propagation (The Snowball Effect)

One of the most fascinating findings is Error Propagation. In a multi-turn conversation, the model’s input for Turn 3 includes its own response from Turn 2. If the response in Turn 2 was wrong, Turn 3 is now conditioned on bad data.

The researchers tested this by feeding the models “Gold” context (manually corrected history) versus “Predicted” context (the model’s own previous history).

Result: Models performed significantly better when provided with the “Gold” history.
Implication: A large part of multi-turn failure isn’t just about understanding the current query; it’s about being misled by previous mistakes. Once a model hallucinates or makes an error, it tends to double down on that error in subsequent turns.

4. The Distraction Factor

How easily can a model be distracted? The researchers injected irrelevant “chitchat” turns between a document and the question about that document.

Table 6: Performance of various LLMs in Refinement task with varying numbers of distracting turns.

Table 6 shows the impact of these distractions. When irrelevant turns are inserted “Between” the document and the query, performance drops for most models (except GPT-4, which remains stoic). This confirms that increasing the distance to relevant content makes retrieval harder, even if the intervening content is just noise.

Alignment with Human Judgment

A common criticism of using GPT-4 as a judge is whether it truly reflects human preference. To validate their automated scoring, the authors had human annotators review a subset of the data.

Table 5: The correlation scores between human ratings and GPT-4 ratings.

As shown in Table 5, there is a strong correlation (Pearson 0.65) between human and GPT-4 ratings. This suggests that the automated metrics used in MT-Eval are a reliable proxy for how a human would perceive the quality of the conversation.

Conclusion and Implications

MT-Eval serves as a reality check for the LLM community. While we celebrate high scores on knowledge benchmarks, this research highlights that conversational robustness is a separate and harder capability.

Key Takeaways:

The Multi-Turn Gap: Don’t trust single-turn benchmarks to predict chatbot performance. Models degrade as conversations lengthen.
Memory is Fragile: Unless you are using state-of-the-art closed models, expect the AI to forget constraints set early in the chat.
One Slip Ruins the Trip: Error propagation means that a single hallucination can derail an entire session.
Open Source is Catching Up: Models like Mixtral are closing the gap, specifically in tasks like Follow-up questions, though they still lag in long-term recollection.

For students and developers, this paper underscores the importance of testing applications in realistic, long-context scenarios. Building a reliable AI assistant isn’t just about getting the first answer right—it’s about keeping the conversation on track, turn after turn.

Beyond the First Prompt: Evaluating How LLMs Handle Long Conversations with MT-Eval

Introduction

The Context: Why Multi-Turn is Hard