Have you ever opened an email or a chat message and seen those little “Smart Reply” bubbles at the bottom of the screen? They offer quick, canned responses like “Sounds good!” or “I’ll take a look.”
Sometimes, they are helpful. But often, they are completely off the mark. You ignore them and start typing your own response manually.
In the world of AI research, that moment—where you look at the suggestions and decide not to click them—is usually treated as a dead end. It is a failed interaction. However, researchers Benjamin Towle and Ke Zhou from the University of Nottingham and Nokia Bell Labs see it differently. They view that rejection as a valuable signal. By ignoring the suggestions, you have implicitly told the system what you don’t want to say.
In their paper, “Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback,” the authors introduce a system called NIFTY. It is a clever method that takes that “negative feedback” (the act of ignoring Smart Replies) and uses it to guide a Generative AI model to write a much better draft.
This post explores how NIFTY works, the math behind “classifier guidance,” and why learning from mistakes is just as important for AI as it is for humans.
The Problem: Disconnected Systems
To understand the innovation here, we first need to look at the landscape of AI-mediated communication (AI-MC). Currently, there are two main ways AI helps us communicate:
- Smart Reply Systems: These are retrieval-based. They look at the incoming message and pull the best matches from a database of pre-written, canned responses. They are fast but rigid.
- AI Writing Assistants: These are generative models (like GPT or T5). They generate text token-by-token. They are flexible and creative but computationally heavier.
In most applications, these two systems operate in silos. If you don’t click a Smart Reply, the system essentially shrugs and waits for you to type. If you then engage an AI writing assistant (like an autocomplete feature), that assistant usually starts from scratch, completely unaware that you just rejected a set of specific intents suggested by the Smart Reply system.
This is a missed opportunity. If the Smart Reply suggested “No, I can’t make it,” and you didn’t click it, the AI Writer should know that your intent is probably not to decline the invitation.
The Proposed Workflow
The researchers propose a pipeline where these two systems talk to each other.

As shown in Figure 1, the workflow changes:
- Incoming Message: A customer asks, “I want to return this shirt.”
- Smart Reply: The system suggests responses like “What is the address?” or “Would you like to buy a different shirt?”
- User Action: The agent (user) ignores these suggestions because they actually need to ask for a receipt.
- Implicit Negative Feedback: This rejection is passed to the AI Writer.
- NIFTY Generation: The AI Writer generates a response, specifically steering away from the intents found in the rejected Smart Replies.
The result? The model generates “Do you have the receipt?” instead of halluncinating a random response or repeating the bad suggestions.
The Core Method: NIFTY
NIFTY stands for Negative Implicit Feedback from SmarT replY.
The engineering challenge here is “heterogeneity.” Smart Reply systems usually output a list of sentences or embeddings, while Generative models output probability distributions over vocabulary tokens. How do you force a Generative model to “listen” to the fact that a user ignored a retrieval model?
The authors solve this using Classifier Guidance.
1. The Standard Generator
First, let’s look at how a standard AI writer works. It is an autoregressive model. It predicts the next word (token) based on the input message (\(m\)) and all the words it has generated so far (\(r_{ Mathematically, the probability of generating a full reply \(r\) is the product of the probabilities of each token: This model (\(\Theta\)) just wants to write fluent, likely text. It doesn’t know anything about the user’s rejection of other ideas. To inject the “negative feedback,” the authors don’t retrain the massive generator. Instead, they use a smaller, separate model—a Classifier—to guide the generation process at run-time. This is based on Bayes’ rule. We want to calculate the probability of a token (\(r_t\)) given the message, the history, and a specific control attribute (\(c\)). Here is how to read this equation: If the Generator wants to say “No,” but the Classifier calculates that saying “No” violates the condition \(c\), the combined probability drops, and the model chooses a different word. The critical question is: What is \(c\)? What exactly are we telling the model to aim for? The paper explores two approaches, but the Intent-Based Approach proved to be the most powerful. In this approach, the system first identifies the intents of the rejected Smart Replies. Let’s say the rejected suggestions were: The user ignored these. Therefore, the goal is to generate a response that has a different intent. The Classifier is trained to predict the intent of a sentence as it is being written. At every step of generation, the system looks for the intent (\(z_j\)) that maximizes the probability of being the correct intent, strictly filtering out the intents that were in the rejected suggestions (\(\mathbf{z_s}\)). In simpler terms: The system asks, “Of all the possible intentions a user might have here, which is the most likely one excluding the ones they just rejected?” To test NIFTY, the authors couldn’t just use live users immediately, so they built a robust simulation using two famous dialogue datasets: MultiWOZ and Schema-Guided Dialog (SGD). These datasets are full of customer service interactions (booking hotels, trains, restaurants). They created a “User Simulator” to mimic the behavior described in the introduction. See Table 5 below for the dataset statistics. Notice how many intents exist in MultiWOZ (685 intents)—this makes the task incredibly difficult because the model has to choose the right specific intent out of hundreds. The researchers used two main automated metrics: The results were highly significant. The authors compared NIFTY against several baselines, including a standard AI Writer (“Base”), and methods that try to “rerank” finished sentences rather than guiding the generation word-by-word. Looking at Table 1 (top section of the image above): The table compares “Action” vs “Intent” guidance. “Action” means the classifier just tries to predict “Rejection.” “Intent” means the classifier tries to predict the specific next intent. The Intent-based approach is clearly superior, proving that telling the AI why the user rejected the suggestion (i.e., “they didn’t want intent X”) is better than just telling it “they rejected it.” Metrics like ROUGE can only tell us so much. The gold standard is human preference. The authors conducted a blind “win-rate” study (Table 3 in the image above). To really see NIFTY in action, let’s look at a specific conversation from the test set. In Table 4, we see a difficult scenario: This perfectly aligns with the Target (Ground Truth). By knowing what the user didn’t click, NIFTY avoided the trap that the Baseline fell into. The “NIFTY” paper provides a compelling argument for better integration between the different AI tools we use daily. It highlights a few key takeaways for students and practitioners of AI: As AI assistants become more ubiquitous, the ability to infer intent from implicit cues—like silence, rejection, or hesitation—will mark the difference between a frustrating chatbot and a truly helpful assistant. NIFTY is a significant step toward that smarter future.
2. Adding Classifier Guidance

3. Defining the Condition (\(c\))
Booking-Inform)General-Apology)
Experimental Setup
The User Simulator

Metrics
Results: Does it Work?
Quantitative Performance

Human Evaluation
A Qualitative Example

Conclusion and Implications
](https://deep-paper.org/en/paper/2410.11009/images/cover.png)