Taming the Hallucinations: Mastering Domain-Specific Code Generation with DocCGen

Introduction

If you have ever used tools like GitHub Copilot or Amazon CodeWhisperer, you know the magic of watching a Large Language Model (LLM) turn a simple comment into a functioning Python function or a complex Java class. These models, trained on massive repositories of general-purpose code, have revolutionized software development.

However, the magic often fades when you step away from mainstream languages like Python or C++ and into the world of Domain-Specific Languages (DSLs).

Consider an enterprise DevOps engineer writing Ansible YAML configurations or complex Bash scripts for server automation. These languages are governed by strict, specific schemas that change from module to module. When an LLM tries to generate this code, it often “hallucinates”—inventing flags that don’t exist, hallucinating parameters, or failing to adhere to the strict indentation rules of YAML.

Why does this happen? General-purpose LLMs simply haven’t seen enough of these specialized enterprise libraries during their pre-training. While standard fine-tuning or in-context learning (showing the model examples) can help, they often fail when data is scarce—which is almost always the case for custom enterprise DSLs.

In this post, we are deep-diving into DocCGen (Document-based Controlled Code Generation). This framework, proposed by researchers from IIT Bombay and IBM Research, introduces a robust solution to the “DSL gap.” By treating code generation as a two-step process involving retrieval and constrained decoding, DocCGen forces LLMs to adhere to the strict documentation of libraries they may have never even seen before.

Figure 1: Illustration of shortcomings with fine-tuning and DocPrompting approaches compared to DocCGen.

As illustrated in Figure 1, standard methods (like DocPrompting or simple fine-tuning) often generate commands with invalid arguments. In contrast, DocCGen successfully identifies the correct parameters (like chcon --role) by strictly following the documentation.

The Problem with DSLs and LLMs

To understand why DocCGen is necessary, we first need to understand the unique challenges of Domain-Specific Languages.

Languages like Python are “resource-rich.” There are billions of lines of Python code publicly available for models to learn from. In contrast, DSLs like Ansible YAML, JSON, HCL (HashiCorp Configuration Language), or specific Bash utilities are “structured.” They are used to configure systems, manage infrastructure, and automate workflows.

These languages present two major hurdles for standard LLMs:

Strict Schema: A Python function might work even if you write it in slightly different ways. A YAML configuration for a specific Ansible module, however, will crash if you provide a key called user_name when the schema expects username.
Unseen Libraries: Enterprises often create custom internal libraries. An LLM cannot memorize the syntax of a library that didn’t exist in its training data.

Attempts to fix this using Retrieval-Augmented Generation (RAG)—where the model is given relevant documentation in the prompt—are helpful but insufficient. Even with the documentation in front of it, an LLM generates tokens probabilistically. It might still “guess” a wrong keyword if that keyword is statistically probable in its training distribution, ignoring the documentation provided in the prompt.

The DocCGen Framework

The researchers propose a neuro-symbolic approach that doesn’t just ask the model to follow the documentation—it forces it to.

DocCGen operates in two distinct stages:

Information Retrieval (IR): Detecting the correct library and retrieving its documentation.
Constrained Generation: Using the grammar and schema rules from that documentation to constrain the model’s output logits.

Let’s break down the architecture.

Figure 2: Overview of DocCGen. The process flows from user query to retrieval, template creation, and finally constrained generation.

Stage 1: Information Retrieval

Everything starts with the user’s Natural Language (NL) query, such as “log in to your lastpass account.”

The first challenge is identifying which tool or library is needed. The system searches through a pool of library documentation. The authors experimented with two types of retrieval systems:

Sparse Retrieval (BM25): This traditional method matches keywords between the query and the documents.
Dense Retrieval (ColBERTv2): This uses deep learning to understand the semantic meaning of the query.

As shown in the table below, Dense Retrieval (specifically fine-tuned ColBERTv2) significantly outperforms sparse retrieval in identifying the correct library (Hits@1), which is a critical first step. If you retrieve the wrong manual, you will inevitably generate the wrong code.

Table 4: Performance of sparse and dense retrieval across NL-to-Code tasks.

Once the relevant documents (e.g., the manual for lpass) are retrieved, the system extracts the grammar and schema of the library.

Stage 2: Constrained Decoding

This is the core innovation of DocCGen. Instead of letting the LLM generate the next token freely, the framework restricts the model’s choices based on the retrieved schema.

This is achieved through a concept called Templates. A template encodes the structure of the code snippet. It consists of two parts:

Static Part: Code that must be there (e.g., the command name git mv).
Variable Part: The dynamic elements that the model needs to generate (e.g., flags like --force or arguments like filename).

How Constrained Decoding Works

When the LLM is generating code, it produces a probability distribution (logits) for every possible word in its vocabulary. Usually, we pick the word with the highest probability.

In Constrained Decoding, the system looks at the current template and the library schema. It identifies the set of valid next tokens. It then sets the probability of all invalid tokens to negative infinity (\(-\infty\)). This ensures the model physically cannot generate a syntax error or a hallucinated flag.

The process involves several algorithms working in tandem:

1. String Selection Algorithm This algorithm constrains the model to generate a string exactly from a pre-defined set. For example, if a specific Ansible module only allows the keys src, dest, and mode, the String Selection algorithm restricts the model’s vocabulary to only tokens that form these specific words.

2. Library Selection When the process begins, the system has retrieved \(k\) potential libraries (e.g., lpass, last, gopass). The model is first constrained to select one of these library templates. Once selected, the model is locked into that library’s specific grammar.

3. Dynamic Template Evolution & Trigger Signals The most impressive part of DocCGen is its ability to handle complex, nested structures. Code isn’t always linear; a decision made early in a command can change what is valid later on.

The authors introduce Trigger Signals. These are rules that detect specific tokens and change the guiding template on the fly.

Bash Example: In the command git commit -m "message", the flag -m expects a string argument. However, if the user types a pipe symbol |, this acts as a trigger signal. It tells the system, “We are starting a new process,” and the system resets to allow the selection of a new utility for the next part of the pipeline.
YAML Example: In Ansible, indentation matters. If the model generates a newline and an indent, strict rules apply. If the indentation indicates a nested object, the Trigger Signal switches the active schema to the nested object’s rules. If the indentation is invalid, the system backtracks and forces correct indentation.

Experimental Setup

To validate this framework, the authors conducted extensive experiments on two complex structured languages: Bash Commands and Ansible YAML.

The Datasets

Bash: They used the TLDR dataset, which contains thousands of Bash pairs. They augmented this data by scraping Linux man-pages to get the strict schema and descriptions for every utility.
Ansible YAML: The authors created a brand new benchmark dataset. They scraped Google BigQuery and Ansible Galaxy to curate over 18,000 NL-to-YAML pairs covering more than 2,500 modules. This is the first publicly available benchmark for NL-to-Structured-Code generation of this scale.

Table 5: Statistics for NL to Ansible-YAML dataset.

Evaluation Settings

The experiments were divided into two rigorous settings:

In-Domain (ID): The test set contains libraries that were present in the training set, but with very few examples (simulating a low-resource environment).
Out-of-Domain (OOD): The test set contains completely unseen libraries. This is the “holy grail” of DSL generation—can the model generate code for a tool it has never been trained on, simply by reading the documentation?

Results and Analysis

The results show that DocCGen consistently outperforms state-of-the-art baselines, including fine-tuned models and standard RAG approaches like DocPrompting.

1. Syntactic and Semantic Correctness

The primary metrics used were Token F1 (overlap with ground truth), Schema Correct (is the code valid?), and Ansible Aware (does it capture the right keys/values?).

In the In-Domain (ID) setting, where the model has seen the libraries before, DocCGen (represented as “base + IR + CD” in the table below) shows massive improvements.

Table 2: Results for each fine-tuned language model for ID setting.

Take a look at the Schema Correct score for the StarCoder2 3B model.

Base fine-tuning achieves only 4.65% correctness.
Adding Retrieval (IR) bumps it to 6.11%.
Adding Constrained Decoding (DocCGen) rockets the score to 51.08%.

This proves that simply retrieving documentation isn’t enough; you must constrain the generation to ensure the model adheres to the retrieved schema.

2. Out-of-Domain (OOD) Performance

The OOD results are perhaps the most significant. Here, the models are generating code for libraries they have never seen during training.

Standard fine-tuning fails catastrophically here because the model tries to hallucinate commands based on similar words it knows. DocCGen, however, retrieves the new library’s manual and enforces its template. This allows the model to generate syntactically correct code for completely novel tools.

3. Performance in Low-Resource Scenarios

One of the key motivations for DocCGen is that DSLs often have very few training samples. The researchers analyzed how performance changes as the number of training samples per module increases (from 4 to 7).

Figure 3: Performance of StarCoder 1B for NL to Ansible-YAML over varying number of train samples.

In Figure 3, the green line (DocCGen) consistently stays above the blue (Fine-Tuning + Retrieval) and orange (Vanilla Fine-Tuning) lines. Even with very few samples, the constrained decoding ensures a baseline of high performance because the “grammar” is derived from documentation, not learned solely from scarce examples.

This trend holds true across different model sizes, from GPT Neo 1.3B to StarCoder2 7B.

Figure 4: Performance of various models in different configurations.

As shown in Figure 4, the gap between the green line (DocCGen) and the others is distinct. The orange line (pure fine-tuning) often hovers near zero for metrics like “Schema Correctness,” emphasizing that without constraints, LLMs struggle to memorize complex schemas from just a handful of examples.

Conclusion

The “DocCGen” paper highlights a critical limitation in current Generative AI: while LLMs are creative, they are not naturally compliant with strict rules. For general prose or Python, this creativity is a feature. For configuring a firewall via Ansible or executing a file operation via Bash, it is a bug.

DocCGen bridges this gap effectively. By decomposing the problem into Library Retrieval and Constrained Decoding, the framework ensures that:

The model knows what tool to use (via Retrieval).
The model knows how to use it strictly (via Templates and Schema constraints).

Key Takeaways for Students

Retrieval is not enough: Providing context (RAG) helps, but models can still ignore it. Constrained decoding is necessary for strict adherence.
Neuro-Symbolic is powerful: Combining the “fuzzy” reasoning of Neural Networks (LLMs) with the “strict” logic of Symbolic AI (templates/grammars) provides the best of both worlds.
Documentation is Data: For enterprises, maintaining high-quality documentation for internal DSLs is just as valuable as maintaining code datasets.

DocCGen suggests a future where AI assistants don’t just “guess” how to use your internal tools—they read the manual and follow it, character for character.

Introduction#

The Problem with DSLs and LLMs#

The DocCGen Framework#

Stage 1: Information Retrieval#

Stage 2: Constrained Decoding#

How Constrained Decoding Works#

Experimental Setup#

The Datasets#

Evaluation Settings#

Results and Analysis#

1. Syntactic and Semantic Correctness#

2. Out-of-Domain (OOD) Performance#

3. Performance in Low-Resource Scenarios#

Conclusion#

Key Takeaways for Students#