NIFTY: How Rejected Smart Replies Can Supercharge AI Writing

Have you ever opened an email or a chat message and seen those little “Smart Reply” bubbles at the bottom of the screen? They offer quick, canned responses like “Sounds good!” or “I’ll take a look.”

Sometimes, they are helpful. But often, they are completely off the mark. You ignore them and start typing your own response manually.

In the world of AI research, that moment—where you look at the suggestions and decide not to click them—is usually treated as a dead end. It is a failed interaction. However, researchers Benjamin Towle and Ke Zhou from the University of Nottingham and Nokia Bell Labs see it differently. They view that rejection as a valuable signal. By ignoring the suggestions, you have implicitly told the system what you don’t want to say.

In their paper, “Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback,” the authors introduce a system called NIFTY. It is a clever method that takes that “negative feedback” (the act of ignoring Smart Replies) and uses it to guide a Generative AI model to write a much better draft.

This post explores how NIFTY works, the math behind “classifier guidance,” and why learning from mistakes is just as important for AI as it is for humans.

The Problem: Disconnected Systems

To understand the innovation here, we first need to look at the landscape of AI-mediated communication (AI-MC). Currently, there are two main ways AI helps us communicate:

Smart Reply Systems: These are retrieval-based. They look at the incoming message and pull the best matches from a database of pre-written, canned responses. They are fast but rigid.
AI Writing Assistants: These are generative models (like GPT or T5). They generate text token-by-token. They are flexible and creative but computationally heavier.

In most applications, these two systems operate in silos. If you don’t click a Smart Reply, the system essentially shrugs and waits for you to type. If you then engage an AI writing assistant (like an autocomplete feature), that assistant usually starts from scratch, completely unaware that you just rejected a set of specific intents suggested by the Smart Reply system.

This is a missed opportunity. If the Smart Reply suggested “No, I can’t make it,” and you didn’t click it, the AI Writer should know that your intent is probably not to decline the invitation.

The Proposed Workflow

The researchers propose a pipeline where these two systems talk to each other.

Figure 1: Example of how an agent may utilise either a smart reply system or an AI writing system to speed up communication with a customer. Our approach, NIFTY, uses implicit negative feedback from the rejected suggestions to improve the AI writer’s prediction.

As shown in Figure 1, the workflow changes:

Incoming Message: A customer asks, “I want to return this shirt.”
Smart Reply: The system suggests responses like “What is the address?” or “Would you like to buy a different shirt?”
User Action: The agent (user) ignores these suggestions because they actually need to ask for a receipt.
Implicit Negative Feedback: This rejection is passed to the AI Writer.
NIFTY Generation: The AI Writer generates a response, specifically steering away from the intents found in the rejected Smart Replies.

The result? The model generates “Do you have the receipt?” instead of halluncinating a random response or repeating the bad suggestions.

The Core Method: NIFTY

NIFTY stands for Negative Implicit Feedback from SmarT replY.

The engineering challenge here is “heterogeneity.” Smart Reply systems usually output a list of sentences or embeddings, while Generative models output probability distributions over vocabulary tokens. How do you force a Generative model to “listen” to the fact that a user ignored a retrieval model?

The authors solve this using Classifier Guidance.

1. The Standard Generator

First, let’s look at how a standard AI writer works. It is an autoregressive model. It predicts the next word (token) based on the input message (\(m\)) and all the words it has generated so far (\(r_{

Mathematically, the probability of generating a full reply \(r\) is the product of the probabilities of each token:

Equation 1

This model (\(\Theta\)) just wants to write fluent, likely text. It doesn’t know anything about the user’s rejection of other ideas.

2. Adding Classifier Guidance

To inject the “negative feedback,” the authors don’t retrain the massive generator. Instead, they use a smaller, separate model—a Classifier—to guide the generation process at run-time.

This is based on Bayes’ rule. We want to calculate the probability of a token (\(r_t\)) given the message, the history, and a specific control attribute (\(c\)).

Equation 2

Here is how to read this equation:

\(\mathbf{p}_{\Theta}(r_t|m, r_{< t})\) is the Generator’s opinion. (e.g., “I think the next word should be ‘No’.”)
\(\mathbf{p}_{\Phi}(c|m, r_{\leq t})\) is the Classifier’s opinion. It asks: “If we pick this word, how likely is it that we will satisfy condition \(c\)?”
The final probability \(\hat{\mathbf{p}}\) is a combination of both.

If the Generator wants to say “No,” but the Classifier calculates that saying “No” violates the condition \(c\), the combined probability drops, and the model chooses a different word.

3. Defining the Condition (\(c\))

The critical question is: What is \(c\)? What exactly are we telling the model to aim for? The paper explores two approaches, but the Intent-Based Approach proved to be the most powerful.

In this approach, the system first identifies the intents of the rejected Smart Replies. Let’s say the rejected suggestions were:

“Refund processed.” (Intent: Booking-Inform)
“Sorry about that.” (Intent: General-Apology)

The user ignored these. Therefore, the goal is to generate a response that has a different intent.

The Classifier is trained to predict the intent of a sentence as it is being written. At every step of generation, the system looks for the intent (\(z_j\)) that maximizes the probability of being the correct intent, strictly filtering out the intents that were in the rejected suggestions (\(\mathbf{z_s}\)).

Equation 4

In simpler terms: The system asks, “Of all the possible intentions a user might have here, which is the most likely one excluding the ones they just rejected?”

Experimental Setup

To test NIFTY, the authors couldn’t just use live users immediately, so they built a robust simulation using two famous dialogue datasets: MultiWOZ and Schema-Guided Dialog (SGD). These datasets are full of customer service interactions (booking hotels, trains, restaurants).

The User Simulator

They created a “User Simulator” to mimic the behavior described in the introduction.

The simulator looks at the ground-truth response (what the human actually said in the dataset).
It looks at the Smart Reply suggestions.
If none of the suggestions match the intent of the ground-truth response, the simulator “rejects” them.
This filtered dataset represents the “hard cases” where Smart Reply failed, which are exactly the cases NIFTY is designed to solve.

See Table 5 below for the dataset statistics. Notice how many intents exist in MultiWOZ (685 intents)—this makes the task incredibly difficult because the model has to choose the right specific intent out of hundreds.

Table 5: Statistics for the MultiWOZ v2.2 and SGD datasets.

Metrics

The researchers used two main automated metrics:

ROUGE-L: Measures how much the words in the generated text overlap with the ground truth.
R@1 (Recall at 1): Measures if the generated response had the correct intent. This is often more important than word overlap in customer service (e.g., saying “I can help” vs “How can I help” is semantically the same, even if words differ).

Results: Does it Work?

The results were highly significant. The authors compared NIFTY against several baselines, including a standard AI Writer (“Base”), and methods that try to “rerank” finished sentences rather than guiding the generation word-by-word.

Quantitative Performance

Table 1: Results on MultiWOZ and SGD test sets using ROUGE-L and R@1 metrics. Table 2: Results on MultiWOZ and SGD test sets under different values of alpha. Table 3: Human evaluation using crowdworkers.

Looking at Table 1 (top section of the image above):

Intent Accuracy (R@1): This is the massive win. On the MultiWOZ dataset, the standard T5 Baseline got the intent right 16.4% of the time. NIFTY Intent jumped to 28.5%. On the SGD dataset, it went from 28.1% to 53.2%. That is nearly a 2x improvement in understanding what the user actually wants to do.
ROUGE-L: The word-overlap scores also improved consistently (e.g., from 18.0 to 24.2 on SGD).

The table compares “Action” vs “Intent” guidance. “Action” means the classifier just tries to predict “Rejection.” “Intent” means the classifier tries to predict the specific next intent. The Intent-based approach is clearly superior, proving that telling the AI why the user rejected the suggestion (i.e., “they didn’t want intent X”) is better than just telling it “they rejected it.”

Human Evaluation

Metrics like ROUGE can only tell us so much. The gold standard is human preference. The authors conducted a blind “win-rate” study (Table 3 in the image above).

Win Rate: When humans compared NIFTY’s output to the Baseline, NIFTY won 86% of the time on MultiWOZ and 78% of the time on SGD.

A Qualitative Example

To really see NIFTY in action, let’s look at a specific conversation from the test set.

Table 4: Qualitative example from the MultiWOZ test set.

In Table 4, we see a difficult scenario:

Context: The previous turn implies a booking attempt.
Suggestions: The Smart Reply system thinks the booking failed. It suggests: “Booking was unsuccessful” or “Would you like to find another hotel?”
User Action: The user ignores these.
Baseline Model: Without guidance, the Baseline model gets confused by the context and hallucinates: “I’m sorry, I can’t make that for you.” It assumes failure, just like the Smart Reply did.
NIFTY: Because the user rejected the “Booking-NoBook” (failure) intent, NIFTY infers the opposite must be true. It generates: “Your booking was successful. Your reference number is…”

This perfectly aligns with the Target (Ground Truth). By knowing what the user didn’t click, NIFTY avoided the trap that the Baseline fell into.

Conclusion and Implications

The “NIFTY” paper provides a compelling argument for better integration between the different AI tools we use daily. It highlights a few key takeaways for students and practitioners of AI:

Data is Everywhere: We often think of training data as explicit labels or positive clicks. This paper shows that inaction (not clicking) is a high-signal data point that is often wasted.
Decoupling is Key: The authors didn’t build one giant “Super Model” that does smart replies and writing. They kept the systems separate and used a lightweight Classifier to bridge them. This is good engineering—it allows the Smart Reply system to be updated without breaking the AI Writer, and vice versa.
Guidance > Reranking: The experiments showed that guiding the model token-by-token (deciding the next word based on the constraint) works much better than generating five sentences and trying to pick the best one (Reranking).

As AI assistants become more ubiquitous, the ability to infer intent from implicit cues—like silence, rejection, or hesitation—will mark the difference between a frustrating chatbot and a truly helpful assistant. NIFTY is a significant step toward that smarter future.

The Problem: Disconnected Systems#

The Proposed Workflow#

The Core Method: NIFTY#

1. The Standard Generator#

2. Adding Classifier Guidance#

3. Defining the Condition (\(c\))#

Experimental Setup#

The User Simulator#

Metrics#

Results: Does it Work?#

Quantitative Performance#

Human Evaluation#

A Qualitative Example#

Conclusion and Implications#