If you have ever tried to use a Large Language Model (LLM) like GPT-4 or LLaMA for a strict data-processing task, you have likely encountered a frustrating phenomenon. You provide a prompt with a list of specific requirements—perhaps ten different demographic facts that must appear in a generated user profile—and the model confidently produces a fluent, professional-sounding paragraph.

But when you check the details, something is wrong. It mentioned the name and the occupation but forgot the credit score. Or perhaps it hallucinated the location.

This problem is known as Lexically Constrained Generation (LCG). It is the task of forcing an LLM to include a specific set of keywords or phrases in its output. While LLMs are masters of fluency, they are surprisingly bad at strict compliance when the constraints get complex.

In the research paper “Control Large Language Models via Divide and Conquer,” researchers from UCLA and UC Merced perform a comprehensive autopsy on why LLMs fail at these tasks. More importantly, they propose a novel, algorithm-agnostic solution called Divide and Conquer (DnC) that boosts success rates by over 90% in complex scenarios.

In this deep dive, we will explore why modern models struggle with simple checklists, the hidden biases in how they read your prompts, and how a recursive “divide and conquer” strategy can turn a forgetful model into a compliant one.


The Problem: The Illusion of Control

We often assume that because LLMs can pass the Bar Exam or write poetry, they can easily handle a task like: “Write a sentence using these 10 words.”

However, LLMs are probabilistic engines, not logic machines. They predict the next token based on statistical likelihood, not based on a checklist of constraints held in working memory. When you increase the number of constraints, the model’s “cognitive load” (to borrow a human term) increases, and it begins to prioritize fluency over accuracy.

What is Lexically Constrained Generation?

In formal terms, Lexically Constrained Generation involves an input prompt containing a set of constraints \(X = [x_1, ..., x_m]\) (keywords). The goal is to generate a sentence \(Y\) such that every keyword in \(X\) appears in \(Y\).

To measure success, the researchers define two critical metrics.

1. Instance Success Rate: This is a binary pass/fail metric. Did the model include every single keyword requested?

Formula for Instance Success Rate showing 1 if X is a subset of Y, 0 otherwise.

2. Keyword Coverage Rate: If the model failed, how badly did it fail? This measures the percentage of requested keywords that actually made it into the final text.

Formula for Keyword Coverage Rate: Number of Satisfied constraints divided by Total number of constraints.

Visualizing the Failure

The researchers illustrate this problem with a “Profile Writing” task. In the example below, the model is asked to write a profile for “Ben Smith” containing roughly 10 specific data points (Name, Age, FICO score, etc.).

Comparison of Vanilla Generation vs Divide and Conquer. Panel A shows a vanilla prompt failing to include housing and education data. Panel B shows the Divide and Conquer method catching the missing info and merging it.

As seen in Panel (a) of Figure 1, the “Vanilla” approach (asking the model once) results in a fluent paragraph, but it misses the “Housing” and “Education” constraints. The model got distracted by the narrative flow and dropped the data.


The Diagnosis: Why Do LLMs Fail?

Before introducing their solution, the researchers conducted a “sensitivity analysis” to understand the root causes of these failures. They tested models ranging from LLaMA-7b up to GPT-4. The results revealed three major weaknesses in how LLMs process instructions.

1. The Complexity Bottleneck

The most obvious finding is that performance collapses as you add more constraints. It is not a linear decline; it is a cliff.

The researchers used the CommonGen benchmark, asking models to construct sentences from lists of concepts. They expanded the difficulty by increasing the number of keywords from 3 up to 15.

Graph showing instance success rates by number of concepts. GPT-4 starts high but drops. LLaMA models crash to near zero as concepts increase.

As shown in Figure 2, while GPT-4 (the purple bar) remains robust for small sets (3-5 keywords), even it begins to struggle as complexity rises. The smaller models, like LLaMA-7b, are essentially useless once the keyword count passes 10.

When we look at the trend line for success rates across a wider range of keywords, the picture is even grimmer:

Line graph showing instance success rates dropping dramatically as the number of keywords increases from 3 to 15.

Key Takeaway: You cannot simply prompt your way out of this problem with a standard instruction. Once the constraint set exceeds a model’s “working memory,” the instance success rate for smaller models approaches zero.

2. Position Bias: The “Middle Child” Syndrome

Where you place your keywords in the prompt matters.

The researchers discovered that LLMs exhibit significant Position Bias. They don’t pay equal attention to every word in the input sequence.

  • Primacy Effect: Some models (like GPT-4) prioritize keywords that appear early in the prompt.
  • Recency Effect: Other models (like LLaMA2-7b) prioritize keywords that appear at the very end of the prompt.

Keywords buried in the middle of a list are the most likely to be ignored.

Graph showing keyword coverage rate for LLaMA3-8b based on position. Coverage drops for keywords placed later in the sequence.

Figure 4 illustrates this sensitivity for LLaMA3. The downward trend indicates that as a keyword is placed later in the list (higher position index), the model is less likely to include it in the output. This means prompt engineering is incredibly fragile; simply shuffling your list of keywords could change the output entirely.

3. The Compound Word Trap

LLMs do not see words; they see tokens. This causes a unique failure mode for compound words.

Consider the keyword “courthouse”. To a human, this is one concept. To an LLM, this might be tokenized as court and house.

The researchers found that models often “satisfy” the constraint by breaking the word apart. If the prompt asks for “courthouse,” the model might generate: “The basketball player hosted a tournament at the court built beside his house.”

Technically, the tokens are there. Semantically, the constraint was violated. LLaMA-13b incorrectly split 65% of compound words, and even GPT-4 failed on 42% of them. This suggests that the inherent complexity of the vocabulary itself is a barrier to control.

4. Low Responsiveness to Decoding Parameters

A common trick for developers is to tweak “decoding parameters”—specifically Temperature, Top-k, and Top-p. The assumption is that lowering the temperature makes the model more deterministic and focused, potentially improving constraint adherence.

The research refutes this.

Three columns of graphs showing that changing Top-k, Temperature, and Top-p has almost no impact on keyword coverage rates across LLaMA and GPT models.

As Figure 5 demonstrates, the performance lines are remarkably flat. Whether the temperature is 0.1 or 0.9, the LLaMA models (columns a and b) perform almost exactly the same regarding keyword coverage. GPT-4 (column c) shows some fluctuation, but the difference between the best and worst settings is less than 4%.

The verdict: You cannot tune hyperparameters to fix a fundamental lack of reasoning capability.


The Solution: Divide and Conquer (DnC)

Since the models struggle with complexity (too many keywords at once), the researchers propose a strategy based on a fundamental computer science principle: Divide and Conquer.

If the model cannot handle 15 keywords, do not ask it to. Ask it to handle what it can, identify what it missed, and try again.

The DnC Algorithm

The strategy is surprisingly simple yet effective. It works as an iterative loop (refer back to Figure 1b for the visual representation).

  1. Initial Generation: Ask the LLM to generate a sentence \(s'\) using the full set of keywords \(X\).
  2. Assessment: Extract the keywords that actually appeared in \(s'\). Let’s call the satisfied keywords \(Y\).
  3. Identify Failures: Calculate the set of missing keywords: \(X_{miss} = X \setminus Y\).
  4. The “Conquer” Step: If \(X_{miss}\) is not empty, prompt the LLM again. But this time, ask it specifically to generate content containing only the missing keywords.
  5. Merge: Take the new content (which contains the previously missing words) and merge it with the original sentence \(s\).
  6. Repeat: Continue this cycle for a fixed number of iterations (\(k\)) or until all keywords are present.

This approach transforms a “hard” parallel processing task (doing 15 things at once) into a series of “easy” serial tasks (doing 3 things, then the next 3).

Why It Works

The breakdown succeeds because it respects the model’s limitations.

  • It mitigates Complexity by reducing the number of active constraints in any single inference step.
  • It mitigates Position Bias because missing words effectively move to the “front” of the line in the next iteration.
  • It is Model Agnostic. It works for “Black Box” models (like GPT-4 via API) just as well as open-source models because it doesn’t require access to model weights or logits.

Experimental Results: A Quantum Leap

The researchers compared the Divide and Conquer (DnC) strategy against a baseline of Rejection Sampling (RJ). Rejection sampling simply asks the model to generate the text again and again until it (hopefully) gets it right.

Success Rates

The difference in performance is staggering.

Line graph comparing Rejection Sampling (RJ) vs Divide and Conquer (DnC). The error rate for DnC drops to near zero after 4 iterations, while RJ plateaus.

Figure 6 plots the Instance Error Rate (lower is better).

  • Rejection Sampling (Green/Grey lines): Even after 5 attempts, the models still fail 40-80% of the time. They are just repeating the same mistakes.
  • Divide and Conquer (Blue/Gold lines): The error rate plummets. Within 4 iterations, both LLaMA-7b and GPT-3.5 achieve near-perfect performance (0% error rate).

Real-World Applications

The researchers tested DnC on three realistic downstream tasks: Recipe Generation, Table-to-Text Summarization, and Profile Writing.

The results, shown in Table 1 below, highlight the robustness of the method.

Table showing performance on Recipe, Table-to-Text, and Profile Writing. DnC-5 (Divide and Conquer with 5 iterations) achieves 98-100% success rates across almost all tasks, vastly outperforming the base models.

Look at the LLaMA2-7b-chat row. In the “Recipe Generation (\(n=15\))” task, the base model had a success rate of 5%. It was essentially incapable of the task. With the DnC strategy applied (“LLaMA2-7b-chat (DnC-5)”), that success rate jumped to 98%.

This transforms a small, 7-billion parameter model—which usually isn’t smart enough for complex constraints—into a tool that outperforms a raw GPT-3.5 query.

What About Text Quality?

A concern with iterative merging is that the final text might look like a “Frankenstein” monster—choppy sentences stitched together.

To verify this, the researchers conducted human evaluations and automatic evaluations using GPT-4-turbo. They rated the outputs on Coherence, Fluency, and Readability.

  • Readability/Fluency: Both methods scored 5.0/5.0.
  • Coherence: The DnC text scored 4.88, practically identical to the vanilla text’s 4.94.

The iterative merging process (likely handled by the LLM itself during the “merge” step) maintains the narrative flow while injecting the missing data.


Conclusion

The paper “Control Large Language Models via Divide and Conquer” offers a vital reality check for anyone building applications on top of LLMs. We cannot assume that prompts alone act as functional code. When constraints increase, probability distributions fail to guarantee inclusion, leading to dropped data and hallucinations.

The limitations identified—complexity bottlenecks, position bias, and tokenization issues—are inherent to the transformer architecture’s current usage. We cannot simply “tune” our way out of them with temperature settings.

However, the proposed Divide and Conquer strategy proves that we don’t necessarily need smarter models to get better results; we need smarter workflows. By wrapping the LLM in a control loop that iteratively checks for errors and prompts specifically to fix them, we can achieve nearly 100% reliability on tasks that previously broke the models.

For students and developers, the lesson is clear: Don’t trust the model to handle a complex checklist in one shot. Break it down, verify the output, and iterate. Control comes not from the prompt, but from the process.