Teaching AI to Fill in the Blanks: A Graph-Based Approach to Incomplete Utterance Rewriting

Imagine you are texting a friend about a movie.

Friend: “Have you seen Oppenheimer yet?” You: “Who is the director?” Friend: “Nolan.” You: “Oh, I love him.”

To a human, this conversation is crystal clear. When you say “him,” you mean Christopher Nolan. When your friend says “Nolan,” they actually mean “Christopher Nolan is the director.” We constantly omit words (ellipsis) or use pronouns (coreference) because the context makes the meaning obvious.

However, for an AI—like a chatbot or a customer service agent—this “incomplete” way of speaking is a nightmare. If a system analyzes just your last sentence (“Oh, I love him”) without resolving the context, it has no idea who you love.

This problem is known as Incomplete Utterance Rewriting (IUR). The goal is to take a context-dependent sentence and rewrite it into a standalone sentence that means the same thing.

In this post, we will deep dive into a fascinating paper titled “Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation”. The researchers propose a novel framework called EO-IUR that combines graph neural networks with a clever “editing” mechanism to teach AI how to focus on the right words in a conversation.

The Core Problem: Why is IUR Hard?

In multi-turn dialogues, speakers aim for efficiency. Research shows that over 70% of utterances in conversations contain ellipsis (omitting words) or coreference (using words like “he”, “it”, or “that” to refer to previous nouns).

To solve this, an AI must rewrite the sentence to make it self-contained. Let’s look at an example from the paper.

Table 1: An example of IUR. The first two utterances are historical utterances, u3 is an incomplete utterance, and u'3 is the rewritten utterance.

In Table 1, the user asks “Who is Tolstoy?” followed by “He is the author.” To make the final sentence complete, the model must:

Resolve the coreference: “He” \(\rightarrow\) “Tolstoy”.
Resolve the ellipsis: Insert “of Anna Karenina” at the end.

The Struggle of Existing Methods

Historically, there have been two main ways to tackle this:

Sequence Labeling: The model tags each word in the history to decide if it should be kept or added. This is good for copying exact words but struggles with grammar and inserting new words where they belong.
Generation (Seq2Seq): Using models like BART or GPT to generate the new sentence from scratch. These are fluent but prone to “hallucinations”—they often generate redundant words or miss the critical entities entirely because they lose focus on the context.

The researchers behind EO-IUR realized that to get the best of both worlds, they needed a model that understands the deep structure of the dialogue and knows exactly which words to edit.

The Solution: EO-IUR Framework

The proposed model, EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting), is a multi-task learning framework. It doesn’t just try to rewrite the sentence; it simultaneously learns the grammatical structure of the dialogue and predicts exactly which “editing operations” (insert or replace) are needed.

Here is the high-level architecture:

Figure 1: Overview of our model EO-IUR, which includes uterance augmentation, construction of token-level heterogeneous graph convolutional neural network, editing operation labeling, and editing operation-guided IUR.

Let’s break down the three main components that make this architecture tick: the Graph, the Editing Guidance, and the Augmentation.

1. Dialogue as a Heterogeneous Graph

Standard Transformer models (like BERT or BART) treat text as a linear sequence of words. However, dialogues have a complex structure. The researchers argue that a linear sequence isn’t enough to capture relationships like “who said what” or “which pronoun refers to which noun.”

To fix this, they construct a Token-Level Heterogeneous Graph.

They treat every word (token) and every speaker as a node in a graph. Then, they connect these nodes using four specific types of edges (\(\mathcal{R}\)):

Intra-utterance edge: Connects words within the same sentence based on their syntactic tree (e.g., connecting a verb to its object).
Inter-utterance edge: Connects the root of one sentence to the root of the next, establishing the flow of conversation.
Speaker-utterance edge: Connects a generic “Speaker” node to the words that speaker said.
Pseudo-coreference edge: This is crucial. It connects all pronouns in the incomplete utterance to nouns and pronouns in the history. This explicitly tells the model, “Hey, this ‘he’ might be related to ‘Tolstoy’.”

Once the graph is built, they use a Graph Convolutional Network (GCN) to pass information between nodes. The representation of a word is updated by aggregating information from its neighbors.

Equation 1 showing the graph convolution operation.

In this equation, \(H^{enc}\) is the initial encoding from BART. The GCN updates these features to creating a structure-aware representation. Finally, the graph features are fused with the original BART features to get a rich, context-aware representation \(\hat{H}^{enc}\).

2. Editing Operation Guidance

This is the “special sauce” of the paper. Instead of asking the generation model to figure out everything on its own, the researchers introduce an auxiliary task: Editing Operation Labeling.

They define four specific labels that tell the model what to do with a token:

Table 2: Four labels of editing operations and non-operation.

NA: Do nothing (most words fall here).
RP (Replace): This word is a pronoun that needs to be replaced (Coreference).
NW (New Word): This word is the replacement for a pronoun.
IN (Insert): This word was omitted and needs to be inserted (Ellipsis).

Look at how this applies to the Tolstoy example:

Table 3: An example of labels, where blue, red and green fonts correspond to “RP”, “NW” and “IN” labels, respectively.

The model learns to predict these labels using a simple Multi-Layer Perceptron (MLP) on top of the encoder:

Equation 2 for the softmax probability distribution of editing labels.

How does this guide generation?

This is where the magic happens. The model calculates the probability of a token being a “critical” token (RP, NW, or IN). It then uses this probability to modify the attention mechanism of the decoder.

Normally, the decoder pays attention to all words based on semantic similarity. In EO-IUR, the attention scores are weighted by the “Editing” probability. If a word is tagged as NA (not important for rewriting), the model suppresses attention to it. If a word is tagged as IN (it needs to be inserted), the model forces the decoder to focus on it.

Equation 4 showing the modified cross-attention mechanism with lambda influence factors.

Here, \(\lambda\) represents the “importance” of a token based on its editing label. This forces the generation model to focus specifically on the entities and words that fill the gaps, significantly reducing hallucinations.

The final loss function combines the standard generation loss (\(\mathcal{L}_{gen}\)) with the editing classification loss (\(\mathcal{L}_{eol}\)):

Equation 6 showing the joint optimization loss function.

3. Two-Dimensional Utterance Augmentation

A major challenge in IUR research is data scarcity. Datasets are small. To combat this, the authors propose two augmentation strategies:

Editing Operation-based Augmentation: They assume that ellipsis and coreference are interchangeable. They take training samples and manually swap them. For example, if a sentence has an ellipsis, they insert a pronoun to turn it into a coreference problem, and vice versa. This doubles the training variety.
LLM-based Historical Augmentation: They use a Large Language Model (like GPT-3.5) to rewrite the context (history) of the dialogue without changing the meaning. This helps the model become robust to different phrasing styles in conversation history.

Experiments and Results

The researchers tested EO-IUR on three major datasets: TASK (English task-oriented dialogue), REWRITE (Chinese open-domain), and RES200K (a massive Chinese dataset).

The results were impressive. Let’s look at the English TASK dataset first.

Table 4: Result comparison on English TASK.

EO-IUR achieves an Exact Match (EM) score of 80.8%. This is a massive leap—nearly 10 points higher than the previous state-of-the-art (SGT and QUEEN). Exact Match is a very hard metric; it requires the generated sentence to be word-for-word identical to the ground truth.

The results are consistent across Chinese datasets as well.

Table 5: Result comparison on Chinese REWRITE.

On the REWRITE dataset (Table 5), EO-IUR dominates in BLEU scores and ROUGE scores, indicating that the rewritten sentences are both fluent and accurate.

Ablation Study: Does the “Fancy Stuff” Work?

You might wonder, “Do we really need the graph? Do we really need the editing labels?” The authors performed an ablation study to find out.

Table 7: Ablation study on the TASK dataset.

As shown in Table 7, removing the Editing Guidance (w/o ED guidance) causes the EM score to drop by 3.8 points. Removing the Heterogeneous Graph causes a similar drop. This proves that both the structural understanding (Graph) and the explicit focus mechanism (Editing Labels) are crucial for the model’s high performance.

Human Evaluation and GPT-4

Automatic metrics are great, but human judgment is the gold standard. The authors conducted a blind test where human annotators compared EO-IUR against other models.

Figure 2: Human evaluation on REWRITE.

EO-IUR (the blue bars) wins the majority of the time against strong baselines like HCT and BART.

But can it beat GPT-4?

This is perhaps the most surprising result. General-purpose LLMs like ChatGPT are powerful, but they often struggle with the strict constraints of rewriting without adding conversational fluff.

Table 10: Performance comparison to GPT-4 on REWRITE.

On the REWRITE dataset, EO-IUR achieved an Exact Match of 79.9%, while GPT-4 only managed 35.7%.

Why the huge difference? GPT-4 often “under-rewrites” (leaving ambiguity) or “hallucinates” (adding polite conversational filler that wasn’t in the original meaning). The specialized EO-IUR model, guided by its editing labels, is surgical in its precision.

Here is a comparison of outputs:

Table 14: Comparison of examples generated by GPT-4 and our EO-IUR.

In the first example, GPT-4 gets confused by the history and asks for an “expensive restaurant,” contradicting the user’s latest request for “Korean food” (which implies they are switching topics or refining). EO-IUR correctly combines the incomplete utterance “Korean food” with the constraint “expensive price range.”

Conclusion

The EO-IUR paper demonstrates a powerful concept in AI: specialized, structured models can still outperform massive generalist models on specific tasks.

By treating dialogue as a graph, the researchers allowed the AI to “see” the grammatical connections between sentences. By introducing Editing Operation Guidance, they gave the AI a “highlighter pen,” teaching it to focus only on the specific words that need to be changed or moved.

This work is a significant step forward for conversational AI. It brings us closer to virtual assistants that don’t just listen to the last thing you said, but understand the entire flow of conversation just like a human friend would.

The Core Problem: Why is IUR Hard?#

The Struggle of Existing Methods#

The Solution: EO-IUR Framework#

1. Dialogue as a Heterogeneous Graph#

2. Editing Operation Guidance#

3. Two-Dimensional Utterance Augmentation#

Experiments and Results#

Ablation Study: Does the “Fancy Stuff” Work?#

Human Evaluation and GPT-4#

Conclusion#