Large Language Models (LLMs) have become incredibly adept at understanding vast amounts of text. Give them a 100-page document, and they can summarize it, answer questions about it, and find needles in the haystack. But when you flip the script and ask them to generate a long, high-quality document—like a detailed report, a compelling story, or a legal brief—they often stumble. The output might be coherent at a sentence level, yet can quickly lose focus, become repetitive, or fail to meet the specific, nuanced requirements of the prompt.
The core of the problem lies in how we teach these models to write well. Traditional methods often rely on scarce, high-quality human-written examples or use generic, coarse-grained feedback during training. A model might be rewarded for being “helpful” or “coherent,” but these are blunt instruments. Imagine your instruction is to write a story with an O. Henry-style ending—a highly specific stylistic demand that a generic “helpfulness” score can’t capture.
This is the challenge tackled by the new research paper “ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning.” The authors propose a clever new framework that shifts the training paradigm. Instead of depending on subjective, high-level feedback, they teach the model to satisfy a detailed checklist of concrete, verifiable constraints derived directly from the user’s instruction. It’s a move from “write a good story” to “write a story that meets these 10 specific criteria.”
This method—called ACE-RL—not only improves the quality of long-form generation but does so without requiring expensive, manually curated preference datasets. Let’s dive into how it works.
Figure 1: Traditional reward mechanisms (middle panel) focus on broad qualities like relevance and helpfulness. ACE-RL (bottom panel) decomposes instructions into specific, fine-grained constraints, producing a more targeted and effective reward signal.
The Trouble with Teaching Long-Form Writing
Before unpacking ACE-RL, let’s briefly review the two dominant methods for training LLMs on specific tasks—and why they fall short for long-form generation.
Supervised Fine-Tuning (SFT)
This is like showing a student thousands of example essays and hoping they infer what makes a good one. You take a pre-trained LLM and fine-tune it on a dataset of high-quality instruction-response pairs. The problem? Getting a large, diverse dataset of excellent long-form writing is incredibly difficult and expensive. More often than not, these datasets are synthesized by other proprietary LLMs, introducing biases and limitations. SFT is fundamentally about imitation, and its performance is capped by the quality and scope of its training data.Reinforcement Learning (RL) with Preference Rewards
This is more like having a teacher compare two student essays and pick the better one. An LLM generates multiple responses to a prompt, and a “reward model” (often another LLM or a human) provides a preference-based score. This is the basis of Reinforcement Learning from Human Feedback (RLHF). While powerful, this approach typically relies on coarse-grained, holistic judgments for qualities like relevance, coherence, and helpfulness. As with the O. Henry example, such general labels fail to capture the specific, instruction-adaptive details that define high-quality writing for diverse scenarios. It also requires massive amounts of preference data (pairs of “good” vs. “bad” responses), creating a costly bottleneck.
The authors argue: to truly master long-form generation, we need a training signal that is fine-grained and instruction-adaptive. ACE-RL delivers precisely that.
The ACE-RL Framework: Step-by-Step
At its heart, ACE-RL transforms the subjective task of evaluating writing quality into an objective process of constraint verification. The system is built in three stages: Data Preparation, Adaptive Constraint Construction, and Reward-Guided Training.
Figure 2: ACE-RL workflow. An instruction is first broken down into a verifiable checklist. The policy model generates responses (rollouts), scored by a reward model on how well they meet each constraint. This reward signal updates the policy model.
Step 1: Data Preparation
The researchers began with WildChat-1M, a large dataset of real-world human–LLM conversations, and filtered for queries requiring long-form responses (e.g., reports, stories, detailed plans). Using Qwen3-235B, they estimated a target word count for each instruction, since meeting length requirements is a critical measure of success.
The outcome: a high-quality dataset of 32,000 instructions, each tagged with an explicit target length.
Step 2: Adaptive Constraint Construction
This is the pivotal stage. For each instruction, the team prompted an LLM to act as an instruction analyst, breaking the request into a checklist of verifiable constraints. Every instruction contains:
- Explicit constraints: directly stated in the prompt (e.g., “use the two smallest disks for the boot device”).
- Implicit constraints: unstated expectations inferred by context or domain knowledge (e.g., a disk topology should allow efficient storage and future expansion).
Figure 3: Real-world constraint deconstruction. Direct requests become explicit constraints (red), while underlying goals and best practices become implicit constraints (green).
This transforms vague instructions into concrete, measurable objectives—perfect for guiding RL.
Step 3: Reward Design and RL Training
With checklists in hand, the final task is to train the policy model using RL. The authors used Group Relative Policy Optimization (GRPO), but the magic lies in the reward design.
1. Length Reward (\(R_L\))
Encourages responses close to the target length \(L_t\). If the response length \(L_{\hat{y}}\) is within tolerance \(\Delta\), it scores 1.0; otherwise, the score decays exponentially:
2. Constraint Reward (\(R_C\))
A separate verifier LLM checks each constraint \(c_i\) on the list, scoring:
- 1.0 if Fully Met
- 0.5 if Partially Met
- 0.0 if Not Met
The average score across all \(N\) checklist items is:
\[ R_C(\hat{y}) = \frac{1}{N} \sum_{i=1}^{N} s(\hat{y}, c_i) \]Overall Reward
The final reward is:
This precise, verifiable signal lets the RL algorithm push the model toward genuine quality improvements.
Experiments and Results
The team trained several open-source models of different sizes and compared them against a wide set of baselines across WritingBench and Arena-Write.
Table 1: The dataset features a high average required length and multiple constraints per instruction, well-suited for robust training.
WritingBench: Stunning Performance
WritingBench evaluates long-form writing in six domains (Academic, Finance, Literature, etc.) and three requirement types (Style, Format, Length).
Table 2: ACE-RL fine-tuned models (highlighted) achieve top scores, surpassing strong baselines, larger models, and proprietary systems like GPT-4o.
Examples:
- Qwen-2.5-7B: 57.04 → 78.57 after ACE-RL.
- Qwen-3-4B-thinking: achieves 82.56, outperforming most proprietary models.
These results show that training paradigm can trump model size.
Arena-Write: Head-to-Head Wins
In Arena-Write, each model’s response goes head-to-head against six strong baselines.
Table 3: ACE-RL-trained models achieve dramatically higher win rates in direct comparisons.
The Qwen-3-4B-thinking ACE-RL model scored a 67.73% win rate, decisively beating top-tier competitors.
Why ACE-RL Wins
The difference lies in reward signal discriminative power. Traditional judge models produce clustered scores, struggling to distinguish subtle quality differences. ACE-RL’s multi-check verification produces higher reward variance:
Figure 4: ACE-RL’s reward signal (blue bars) offers greater variance than LLM-as-a-Judge (yellow), making it more discriminative and effective for learning.
Efficiency and Self-Improvement
ACE-RL works even with smaller verifier models.
Table 5: ACE-RL with a 4B reward model beats the LLM-as-a-Judge approach using an 8B model.
The team even tested a self-reward setting: a model verifying its own outputs. This self-trained model still beat the traditional RL baseline, demonstrating potential for self-alignment without bigger supervisors.
Human Preference Alignment
To confirm the metrics align with human judgment, the authors conducted human evaluation:
Figure 5: Human judges consistently preferred ACE-RL outputs over both the base and LLM-as-a-Judge-trained models.
Conclusion: A New Path for Long-Form Generation
ACE-RL represents a breakthrough in training LLMs for complex, nuanced writing tasks. By replacing coarse, subjective preference scores with fine-grained, instruction-adaptive constraints, it delivers:
- Strong Constraints over Vague Preferences: Turning instructions into checklists yields a clearer, more reliable training signal.
- Quality without Scarce Data: Removes the need for costly preference datasets, enabling scalability.
- Paradigm over Scale: A better training method can outclass much larger models.
ACE-RL points toward teaching models not just to be broadly helpful but to be precisely, verifiably correct according to the complex demands of any given task.