Introduction

In the current landscape of Large Language Models (LLMs), we are witnessing a tug-of-war between two highly desirable traits. On one side, we want models that are helpful and conversational—models that can follow open-ended instructions, write poetry, and chat like a human. On the other side, we desperately need models that are faithful and grounded—models that, when given a specific document or context, answer questions based only on that information without making things up.

For students and practitioners working with Retrieval-Augmented Generation (RAG) or building chatbots, this is a familiar pain point. You might fine-tune a model to be a better chatterbox, only to find it starts hallucinating facts. Conversely, if you strictly train it to adhere to provided documents, it might lose its ability to understand nuanced instructions or engage in fluid conversation.

A recent research paper, titled “Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models,” investigates this exact phenomenon. The authors provide concrete evidence that there is a fundamental trade-off between Instruction Following (the “Dance”) and Faithfulness (the “Chains” of the context).

As shown in Figure 1 above, most models drift toward one extreme or the other. However, the researchers also propose a novel solution called RESET (Rejection Sampling for Continued Self-instruction Tuning), which allows models to approach the “North Star”—high performance in both creativity and accuracy.

In this post, we will break down why this trade-off exists, the methodology used to prove it, and the elegant mechanism behind RESET that helps reconcile these conflicting objectives.

Background: Defining the Conflict

Before we dive into the experiments, we need to clarify the two antagonists in this story.

1. Instruction Following

This refers to an LLM’s ability to respond to open-ended user prompts. Training datasets for this include Alpaca, ShareGPT, or Dolly. These datasets teach the model to be a helpful assistant. The goal is to satisfy the user’s intent, whether that’s “Write a poem about rust” or “Explain quantum physics like I’m five.” The metric for success here is usually an evaluation by a stronger model (like GPT-4), often referred to as “LLM-as-a-Judge.”

2. Faithfulness (Context-Dependent Grounding)

This is the ability of the model to answer questions based strictly on a provided context (a set of passages or documents). This is critical for RAG systems. If you provide a medical study and ask a question, you do not want the model to use its pre-trained knowledge to answer; you want it to use the study. Datasets here include Natural Questions (NQ), MS MARCO, or CNN/DailyMail. The metric for success is groundedness—did the answer come from the text, or was it hallucinated?

The authors suspected that the data distributions for these two objectives are so distinct that they might actually hurt each other during training.

The Evidence: A Tale of Two Pipelines

To prove this trade-off is real and not just anecdotal, the researchers set up a “Two-Stage Fine-Tuning Paradigm.” They essentially forced models to specialize in one skill and then tried to teach them the other, observing the degradation in the original skill.

Figure 2: Two-stage fine-tuning with LLaMA-7B.

As illustrated in Figure 2, the experiment involves two distinct pipelines:

Context-Dependent $\rightarrow$ Instruction Following: Train a model to be faithful first, then fine-tune it to be chatty.
Instruction Following $\rightarrow$ Context-Dependent: Take a chatty model (like Vicuna) and fine-tune it to be faithful.

Let’s look at the results of these experiments.

Pipeline 1: Does Chat Training Hurt Faithfulness?

In the first experiment, the researchers took a LLaMA-7B model and fine-tuned it on context-dependent datasets (like QA and summarization) until it was highly faithful. Then, they further fine-tuned it on instruction datasets (like Alpaca).

The results were stark.

Figure 3: Macro-averaged faithfulness, instruction following, and task performance scores on corresponding evaluation datasets before and after fine-tuning with instruction following datasets.

In Figure 3, look at the Faithfulness with Abstractive Datasets bars. The blue bar represents the model before instruction tuning (0.82), and the orange bar shows the performance after (0.55). There is a massive 33% drop in faithfulness.

While the model became much better at following instructions (Instruction Following score jumped from 0.30 to 0.74), it lost its ability to stay grounded in the context. It started making things up, likely because instruction tuning datasets often encourage creative generation rather than strict adherence to a source text.

Is it just about length?

A common counter-argument is that instruction-tuned models just write longer answers, and longer answers are harder to keep faithful. The researchers investigated this by comparing faithfulness scores across generated responses that were either short or long.

$Figure 5: Faithfulness scores for abstraction QA and summarization datasets categorized by whether the generation length is strictly shorter or much longer \$( > 1 0 0\$ tokens) than the golden answer.$

Figure 5 shows that even for shorter responses (the blue line), faithfulness degrades significantly as training progresses. The drop in quality isn’t just a side effect of verbosity; the model fundamentally “forgets” how to be faithful.

Pipeline 2: Does Faithfulness Training Hurt Chatting?

In the reverse experiment, the researchers took Vicuna-7B (a model already fine-tuned for chatting) and trained it on context-dependent tasks.

Figure 9: Macro-averaged faithfulness, instruction following, and task performance scores on corrsponding evaluation datasets before and after fine-tuning with context-dependent datasets.

As seen in Figure 6 (labeled here as Figure 9 in the image deck), the trade-off strikes again. While faithfulness improved, the Instruction Following score (the third group of bars) plummeted from 0.79 to 0.49.

By forcing the model to strictly adhere to contexts, it became rigid. It lost the “human touch” required to answer open-ended user queries effectively.

The “Similar Length” Control

Again, one might ask: “Did the model just become too brief?” The researchers controlled for this by filtering for examples where the response length remained similar before and after training.

Figure 7: Instruction following scores for 1,000 randomly selected examples from Alpaca-15K (left), and for filtered examples with similar generation lengths (maximally 1O tokens longer) before and after finetuning with context-dependent datasets (right).

Figure 7 confirms that even when the response lengths are similar, the instruction following quality drops significantly (from 0.82 to 0.64 in the filtered setting). The model simply becomes worse at being a helpful assistant.

The Solution: RESET

So, we have a problem. If you train for A, you lose B. If you train for B, you lose A.

The standard industry solution is Multi-Task Learning (MTL)—mixing both datasets together and training the model on everything at once. While the researchers found that MTL helps (as seen in the “North Star” Figure 1), it is still suboptimal. It acts as a compromise rather than a synergy.

To do better, the authors propose RESET: Rejection Sampling for Continued Self-instruction Tuning.

The intuition behind RESET is simple: Instead of just feeding the model raw data, let the model try to answer questions itself, grade its own homework (using external judges), and then retrain only on the questions it got right. This creates a high-quality, filtered dataset that aligns with both objectives.

How RESET Works

The RESET process is a cycle consisting of four main steps, illustrated below:

Step 1: Initial MTL Training

First, train a base model (like Vicuna) on a mixture of both instruction and context-dependent datasets. This gives the model a baseline capability in both areas.

Step 2: Sample Generations

Take a subset of the training data and ask the model to generate answers. Crucially, generate multiple answers for the same prompt by varying decoding strategies (e.g., changing the temperature for randomness or top-k sampling). This creates a diverse pool of potential responses—some creative, some robotic, some faithful, some hallucinated.

Step 3: External Judging and Weighted Scoring

This is the quality control step. Use external evaluators (like GPT-4 for instruction following, and NLI models for faithfulness) to score every single generated response.

The authors use a specific formula to rank these generations:

\[ { \mathrm { s c o r e } } = s _ { \mathrm { t a s k } } + 2 . 0 * \left( \mathbb { I } _ { \mathrm { i n s t r } } * s _ { \mathrm { i n s t r } } + \mathbb { I } _ { \mathrm { f a i t h } } * s _ { \mathrm { f a i t h } } \right) \]

Equation

In this equation:

$s_{task}$: The basic task performance (e.g., ROUGE score).
$s_{instr}$: The instruction following score (how helpful/chatty it is).
$s_{faith}$: The faithfulness score (how grounded it is).
The indicators $\mathbb{I}$ ensure that we weight the score relevant to the dataset type (Instruction vs. Context).

The heavy weighting ($2.0$) on instruction and faithfulness ensures that the selected best response isn’t just accurate—it effectively balances the specific conflict we are trying to solve.

Step 4: Continued Fine-Tuning

Finally, select the single best response (Top-1) for each prompt. Compile these “winner” responses into a new, smaller, high-quality dataset. Fine-tune the model one last time on this curated data.

Experiments & Results: Does It Work?

The researchers compared RESET against standard baselines, including vanilla Vicuna and standard Multi-Task Learning (MTL).

Table 4: Faithfulness and alignment scores on testing datasets,and unseen datasets are italicized. Scores are averaged across three distinct runs. Higher scores are beter. Overall scores are macro-average across datasets.

Table 4 presents the key findings:

MTL closes the gap but leaves room for improvement: The w/ MTL row shows decent scores, acting as a strong baseline.
RESET wins: The w/ MTL+RESET row consistently outperforms standard MTL. For example, on the Abstractive Datasets (Overall), faithfulness jumps from 0.80 (MTL) to 0.85 (RESET).
Less is More (RESET-S): The authors introduced a “Supercharged” version called RESET-S. Here, they sampled even more generations but filtered the final dataset more aggressively, keeping only the absolute best examples. This resulted in a dataset three times smaller than the standard RESET dataset. Despite having less data, RESET-S achieved the highest scores (e.g., 0.92 faithfulness on Abstractive Datasets). This proves that data quality matters far more than quantity when reconciling these objectives.

A Qualitative Look

Numbers are great, but what does this look like in practice? Let’s look at an example where the model is asked to describe a baby duck.

Table 8: Qualitative examples from the second pipeline of our two-stage fine-tuning paradigm where we take Vicuna-7B and fine-tune it on context-dependent datasets with various fine-tuning methods.

In Table 8, notice the progression:

Vicuna Zero-Shot: Very descriptive (“fluffy, downy creature… quacking and waddling”). Good chatting, but maybe too verbose for a specific constraint.
Context-Dependent Tuning: “A baby duck is a young duck…” Very dry, encyclopedic. It lost the charm.
RESET-S: “A baby duck is a small, fluffy, and adorable bird that is often seen swimming in ponds or lakes.”

The RESET model manages to be concise and faithful (it didn’t hallucinate features) while retaining the “human” tone (“adorable,” “fluffy”) that users expect from a chatbot.

Conclusion and Implications

The “Dancing in Chains” paper sheds light on a critical friction point in modern AI development. We often assume that feeding a model more data of different types will simply make it better at everything. However, this research shows that Instruction Following and Faithfulness are effectively competing objectives.

Training for creativity can induce hallucinations.
Training for strict grounding can induce robotic rigidity.

The proposed RESET method offers a path forward. By leveraging the model’s own ability to generate diverse options and using rigorous rejection sampling, we can curate a training set that embodies the best of both worlds.

The most significant takeaway for students and developers is the concept that less is more. You don’t necessarily need massive datasets to fix alignment issues. A small, highly curated dataset—where every example perfectly balances helpfulness and faithfulness—can outperform a massive, noisy dataset.

As we continue to build systems that integrate RAG with conversational personas, techniques like RESET will likely become standard practice to ensure our AI assistants can dance beautifully without breaking the chains of truth.

Introduction#

Background: Defining the Conflict#

1. Instruction Following#

2. Faithfulness (Context-Dependent Grounding)#

The Evidence: A Tale of Two Pipelines#

Pipeline 1: Does Chat Training Hurt Faithfulness?#

Is it just about length?#

Pipeline 2: Does Faithfulness Training Hurt Chatting?#

The “Similar Length” Control#

The Solution: RESET#

How RESET Works#

Step 1: Initial MTL Training#

Step 2: Sample Generations#

Step 3: External Judging and Weighted Scoring#

Step 4: Continued Fine-Tuning#

Experiments & Results: Does It Work?#

A Qualitative Look#

Conclusion and Implications#