If you have ever tried to use a Large Language Model (LLM) like GPT-4 or LLaMA for a strict data-processing task, you have likely encountered a frustrating phenomenon. You provide a prompt with a list of specific requirements—perhaps ten different demographic facts that must appear in a generated user profile—and the model confidently produces a fluent, professional-sounding paragraph.
But when you check the details, something is wrong. It mentioned the name and the occupation but forgot the credit score. Or perhaps it hallucinated the location.
This problem is known as Lexically Constrained Generation (LCG). It is the task of forcing an LLM to include a specific set of keywords or phrases in its output. While LLMs are masters of fluency, they are surprisingly bad at strict compliance when the constraints get complex.
In the research paper “Control Large Language Models via Divide and Conquer,” researchers from UCLA and UC Merced perform a comprehensive autopsy on why LLMs fail at these tasks. More importantly, they propose a novel, algorithm-agnostic solution called Divide and Conquer (DnC) that boosts success rates by over 90% in complex scenarios.
In this deep dive, we will explore why modern models struggle with simple checklists, the hidden biases in how they read your prompts, and how a recursive “divide and conquer” strategy can turn a forgetful model into a compliant one.
The Problem: The Illusion of Control
We often assume that because LLMs can pass the Bar Exam or write poetry, they can easily handle a task like: “Write a sentence using these 10 words.”
However, LLMs are probabilistic engines, not logic machines. They predict the next token based on statistical likelihood, not based on a checklist of constraints held in working memory. When you increase the number of constraints, the model’s “cognitive load” (to borrow a human term) increases, and it begins to prioritize fluency over accuracy.
What is Lexically Constrained Generation?
In formal terms, Lexically Constrained Generation involves an input prompt containing a set of constraints \(X = [x_1, ..., x_m]\) (keywords). The goal is to generate a sentence \(Y\) such that every keyword in \(X\) appears in \(Y\).
To measure success, the researchers define two critical metrics.
1. Instance Success Rate: This is a binary pass/fail metric. Did the model include every single keyword requested?

2. Keyword Coverage Rate: If the model failed, how badly did it fail? This measures the percentage of requested keywords that actually made it into the final text.

Visualizing the Failure
The researchers illustrate this problem with a “Profile Writing” task. In the example below, the model is asked to write a profile for “Ben Smith” containing roughly 10 specific data points (Name, Age, FICO score, etc.).

As seen in Panel (a) of Figure 1, the “Vanilla” approach (asking the model once) results in a fluent paragraph, but it misses the “Housing” and “Education” constraints. The model got distracted by the narrative flow and dropped the data.
The Diagnosis: Why Do LLMs Fail?
Before introducing their solution, the researchers conducted a “sensitivity analysis” to understand the root causes of these failures. They tested models ranging from LLaMA-7b up to GPT-4. The results revealed three major weaknesses in how LLMs process instructions.
1. The Complexity Bottleneck
The most obvious finding is that performance collapses as you add more constraints. It is not a linear decline; it is a cliff.
The researchers used the CommonGen benchmark, asking models to construct sentences from lists of concepts. They expanded the difficulty by increasing the number of keywords from 3 up to 15.

As shown in Figure 2, while GPT-4 (the purple bar) remains robust for small sets (3-5 keywords), even it begins to struggle as complexity rises. The smaller models, like LLaMA-7b, are essentially useless once the keyword count passes 10.
When we look at the trend line for success rates across a wider range of keywords, the picture is even grimmer:

Key Takeaway: You cannot simply prompt your way out of this problem with a standard instruction. Once the constraint set exceeds a model’s “working memory,” the instance success rate for smaller models approaches zero.
2. Position Bias: The “Middle Child” Syndrome
Where you place your keywords in the prompt matters.
The researchers discovered that LLMs exhibit significant Position Bias. They don’t pay equal attention to every word in the input sequence.
- Primacy Effect: Some models (like GPT-4) prioritize keywords that appear early in the prompt.
- Recency Effect: Other models (like LLaMA2-7b) prioritize keywords that appear at the very end of the prompt.
Keywords buried in the middle of a list are the most likely to be ignored.

Figure 4 illustrates this sensitivity for LLaMA3. The downward trend indicates that as a keyword is placed later in the list (higher position index), the model is less likely to include it in the output. This means prompt engineering is incredibly fragile; simply shuffling your list of keywords could change the output entirely.
3. The Compound Word Trap
LLMs do not see words; they see tokens. This causes a unique failure mode for compound words.
Consider the keyword “courthouse”.
To a human, this is one concept. To an LLM, this might be tokenized as court and house.
The researchers found that models often “satisfy” the constraint by breaking the word apart. If the prompt asks for “courthouse,” the model might generate: “The basketball player hosted a tournament at the court built beside his house.”
Technically, the tokens are there. Semantically, the constraint was violated. LLaMA-13b incorrectly split 65% of compound words, and even GPT-4 failed on 42% of them. This suggests that the inherent complexity of the vocabulary itself is a barrier to control.
4. Low Responsiveness to Decoding Parameters
A common trick for developers is to tweak “decoding parameters”—specifically Temperature, Top-k, and Top-p. The assumption is that lowering the temperature makes the model more deterministic and focused, potentially improving constraint adherence.
The research refutes this.

As Figure 5 demonstrates, the performance lines are remarkably flat. Whether the temperature is 0.1 or 0.9, the LLaMA models (columns a and b) perform almost exactly the same regarding keyword coverage. GPT-4 (column c) shows some fluctuation, but the difference between the best and worst settings is less than 4%.
The verdict: You cannot tune hyperparameters to fix a fundamental lack of reasoning capability.
The Solution: Divide and Conquer (DnC)
Since the models struggle with complexity (too many keywords at once), the researchers propose a strategy based on a fundamental computer science principle: Divide and Conquer.
If the model cannot handle 15 keywords, do not ask it to. Ask it to handle what it can, identify what it missed, and try again.
The DnC Algorithm
The strategy is surprisingly simple yet effective. It works as an iterative loop (refer back to Figure 1b for the visual representation).
- Initial Generation: Ask the LLM to generate a sentence \(s'\) using the full set of keywords \(X\).
- Assessment: Extract the keywords that actually appeared in \(s'\). Let’s call the satisfied keywords \(Y\).
- Identify Failures: Calculate the set of missing keywords: \(X_{miss} = X \setminus Y\).
- The “Conquer” Step: If \(X_{miss}\) is not empty, prompt the LLM again. But this time, ask it specifically to generate content containing only the missing keywords.
- Merge: Take the new content (which contains the previously missing words) and merge it with the original sentence \(s\).
- Repeat: Continue this cycle for a fixed number of iterations (\(k\)) or until all keywords are present.
This approach transforms a “hard” parallel processing task (doing 15 things at once) into a series of “easy” serial tasks (doing 3 things, then the next 3).
Why It Works
The breakdown succeeds because it respects the model’s limitations.
- It mitigates Complexity by reducing the number of active constraints in any single inference step.
- It mitigates Position Bias because missing words effectively move to the “front” of the line in the next iteration.
- It is Model Agnostic. It works for “Black Box” models (like GPT-4 via API) just as well as open-source models because it doesn’t require access to model weights or logits.
Experimental Results: A Quantum Leap
The researchers compared the Divide and Conquer (DnC) strategy against a baseline of Rejection Sampling (RJ). Rejection sampling simply asks the model to generate the text again and again until it (hopefully) gets it right.
Success Rates
The difference in performance is staggering.

Figure 6 plots the Instance Error Rate (lower is better).
- Rejection Sampling (Green/Grey lines): Even after 5 attempts, the models still fail 40-80% of the time. They are just repeating the same mistakes.
- Divide and Conquer (Blue/Gold lines): The error rate plummets. Within 4 iterations, both LLaMA-7b and GPT-3.5 achieve near-perfect performance (0% error rate).
Real-World Applications
The researchers tested DnC on three realistic downstream tasks: Recipe Generation, Table-to-Text Summarization, and Profile Writing.
The results, shown in Table 1 below, highlight the robustness of the method.

Look at the LLaMA2-7b-chat row. In the “Recipe Generation (\(n=15\))” task, the base model had a success rate of 5%. It was essentially incapable of the task. With the DnC strategy applied (“LLaMA2-7b-chat (DnC-5)”), that success rate jumped to 98%.
This transforms a small, 7-billion parameter model—which usually isn’t smart enough for complex constraints—into a tool that outperforms a raw GPT-3.5 query.
What About Text Quality?
A concern with iterative merging is that the final text might look like a “Frankenstein” monster—choppy sentences stitched together.
To verify this, the researchers conducted human evaluations and automatic evaluations using GPT-4-turbo. They rated the outputs on Coherence, Fluency, and Readability.
- Readability/Fluency: Both methods scored 5.0/5.0.
- Coherence: The DnC text scored 4.88, practically identical to the vanilla text’s 4.94.
The iterative merging process (likely handled by the LLM itself during the “merge” step) maintains the narrative flow while injecting the missing data.
Conclusion
The paper “Control Large Language Models via Divide and Conquer” offers a vital reality check for anyone building applications on top of LLMs. We cannot assume that prompts alone act as functional code. When constraints increase, probability distributions fail to guarantee inclusion, leading to dropped data and hallucinations.
The limitations identified—complexity bottlenecks, position bias, and tokenization issues—are inherent to the transformer architecture’s current usage. We cannot simply “tune” our way out of them with temperature settings.
However, the proposed Divide and Conquer strategy proves that we don’t necessarily need smarter models to get better results; we need smarter workflows. By wrapping the LLM in a control loop that iteratively checks for errors and prompts specifically to fix them, we can achieve nearly 100% reliability on tasks that previously broke the models.
For students and developers, the lesson is clear: Don’t trust the model to handle a complex checklist in one shot. Break it down, verify the output, and iterate. Control comes not from the prompt, but from the process.
](https://deep-paper.org/en/paper/2410.04628/images/cover.png)