Have you ever tried to edit a scanned document or a PDF where the source file was lost? It is often a frustrating experience. You might want to move a paragraph, change a header level, or update a table value. In a word processor, this is trivial. But in a document image, these elements are just pixels.
Recent advances in Generative AI, particularly diffusion models, have revolutionized image creation. However, they struggle significantly with document editing. If you ask a standard image generator to “change the date in the header,” it might smudge the text, hallucinate new letters, or destroy the table alignment. Documents are not just collections of pixels; they are structured information—text, layout, styling, and hierarchy.
In this post, we are doing a deep dive into DocEdit-v2, a research paper that proposes a sophisticated framework for handling this exact problem. By leveraging Large Multimodal Models (LMMs) and a novel grounding architecture, the researchers have created a system that can understand a user’s intent, find the exact spot in the document, and execute precise structural edits.
The Core Challenge: Why is Document Editing Hard?
To understand why DocEdit-v2 is necessary, we first need to appreciate the difficulty of the task. Language-Guided Document Editing involves three main hurdles:
- Multimodal Grounding: The model must understand where in the image the user is referring to. If a user says “change the second bullet point,” the model needs to visually locate that specific bullet point among potentially dozens of others.
- Disambiguation: User requests are often vague. “Make it look better” or “Move the logo” requires the model to interpret specific attributes (position, style, content) that need changing.
- Structural Fidelity: The edit must look professional. Text must be legible, fonts must match, and the layout must not break. Pixel-level generation often fails here.
The authors of DocEdit-v2 hypothesize that the solution isn’t to generate new pixels directly, but to manipulate the underlying HTML and CSS representation of the document.
The DocEdit-v2 Framework
The DocEdit-v2 framework is an end-to-end pipeline designed to bridge the gap between a user’s natural language request and a structurally perfect document edit.

As shown in Figure 1 above, the process flows through three distinct stages:
- Doc2Command: Visual grounding and edit command generation.
- Command Reformulation: Translating technical commands into instructions for LMMs.
- Document Editing: Using LMMs (like GPT-4V or Gemini) to generate the final HTML structure.
Let’s break down each component.
1. Doc2Command: The “Brain” of the Operation
The first step is figuring out what to do and where to do it. The researchers introduced a novel component called Doc2Command. This is a multi-task model responsible for multimodal grounding (finding the Region of Interest, or RoI) and command generation (converting the request into a structured action).
The Architecture
The architecture of Doc2Command is fascinating because it treats the user request as a visual element.

As illustrated in Figure 2, the process works as follows:
- Visual Rendering: The user’s text request is actually rendered onto the document image itself (visually placed at the top).
- Image Encoder: This combined visual input is fed into a Vision Transformer (ViT) encoder. This allows the model to process the document layout and the text request simultaneously in the same visual space.
- Dual Output Streams:
- Text Decoder: One branch generates a structured text command (e.g.,
ACTION_PARA: REPLACE,COMPONENT: TEXT). - Mask Transformer: The other branch generates a “segmentation map.” This is a pixel mask that highlights exactly which part of the document needs to be changed.
Visual Grounding in Action
The ability to segment the document accurately is what sets this method apart. Instead of just guessing coordinates, the model predicts a mask that covers the specific text or object.

In the example above (Figure 5b), notice how the model (highlighted in bright white) accurately identifies a complex table structure based on the request. It doesn’t just draw a box around the whole page; it isolates the specific element of interest.

Furthermore, as seen in Figure 5c, the model is context-aware. If there are multiple instances of similar text, it uses positional cues (like “from mid to left”) to ground the correct instance.
2. Command Reformulation: Bridging the Gap
The Doc2Command module outputs a structured command, but these commands are often rigid and software-specific (derived from datasets designed for programmatic editing). They might look like MODIFY ACTION_PARA TEXT....
While precise, these commands are not the best “prompt” for a general-purpose Large Multimodal Model like GPT-4V. These models prefer natural, descriptive instructions.
To solve this, the researchers utilize Command Reformulation. They use an LLM to translate the rigid command from Doc2Command into a fluent, explicit instruction.

Figure 3 demonstrates this transformation.
- (a) Left: A raw command might simply say “MODIFY… FINAL_STATE.” It is cryptic.
- (b) Right: The reformulated command translates this into a clear directive: “Modify the textual content… specifically targeting components where emission-related terms appear.”
This step resolves ambiguity. If a user request is vague, the reformulation step uses the context to flesh out the details before the final editing takes place.
3. Generative Document Editing via HTML/CSS
This is the final and perhaps most crucial design choice. Instead of using a diffusion model to “inpaint” the new document (which risks blurry text), DocEdit-v2 asks the LMM to generate HTML and CSS.
HTML provides the hierarchy (headers, paragraphs, tables), and CSS provides the styling (fonts, colors, alignment).
The framework constructs a prompt for the LMM that includes:
- The Reformulated Instruction (what to do).
- The Visual Grounding (the bounding box coordinates found by Doc2Command).
- The Document Image.
The LMM then acts as a coder, rewriting the document’s structure to reflect the changes. This ensures the output is crisp, searchable, and perfectly aligned.
Experiments and Results
The researchers tested DocEdit-v2 on the DocEdit dataset, comparing it against strong baselines including standard GPT-4 and task-specific models like DocEditor.
Grounding and Detection Performance
First, how well does the system find the right spot?

Table 2 shows the results for the bounding box detection task. Doc2Command achieves a Top-1 Accuracy of 48.69%, outperforming the previous state-of-the-art (DocEditor) by over 12%. This is a massive leap in visual grounding, proving that the specialized encoder-decoder architecture is highly effective at localization.
End-to-End Editing Quality
The ultimate test is the quality of the final document. The researchers measured this using standard text metrics (ROUGE-L) and human evaluation metrics (Style Replication, Edit Correctness).

Table 3 displays the performance when using GPT-4V as the backbone.
- The Full Pipeline Wins: The bottom rows (“DocEdit-v2”) show the highest scores across almost all metrics.
- Ablation Studies: Notice the rows where “VG” (Visual Grounding) or “CR” (Command Reformulation) are marked with an “X”. Performance drops significantly. For instance, without Visual Grounding, the system struggles to know where to apply the edit, leading to lower “Edit Correctness.”
The same trend holds true when using Gemini as the base model, as shown below in Table 4:

Qualitative Examples
Numbers are great, but seeing is believing. Let’s look at a complex example involving moving elements.

In Figure 10, the user asks to “Moved ‘Bridge Loan and Bond’ from mid to left.”
- Grounding: The middle panel shows Doc2Command identifying the specific text block and the page number.
- Reformulation: The system interprets this as a “Move” action involving specific components.
- Result: The final panel shows the HTML output where the text block has been successfully relocated to the left column, maintaining the font style and surrounding layout.
New Metrics for a New Task
Because pixel-level metrics (like RMSE) are bad for documents, the researchers introduced two novel metrics:
- DOM Tree Edit Distance: Measures how much the HTML structure changed.
- CSS IoU (Intersection over Union): Measures how well the styling matches the ground truth.

Figure 4 shows that these new metrics correlate well with human judgment. Specifically, CSS IoU has a strong positive correlation (0.73) with “Style Replication,” confirming that if the CSS code matches, the document looks correct to human eyes.
Conclusion
DocEdit-v2 represents a significant step forward in intelligent document processing. By moving away from purely pixel-based editing and embracing structured representations (HTML/CSS), it solves the problem of keeping documents legible and professional during automated editing.
The framework’s success relies on the synergy of its components:
- Doc2Command provides the eyes, accurately spotting where changes are needed.
- Command Reformulation acts as the translator, converting vague intent into clear instructions.
- Multimodal LMMs act as the executor, writing the code to rebuild the document.
For students and researchers in the field, this paper highlights the power of hybrid approaches—combining specialized vision modules with generalist Large Language Models to tackle complex, multi-step tasks. Future work in this area will likely focus on even more complex visual elements, such as editing charts and diagrams directly within the document structure.
](https://deep-paper.org/en/paper/2410.16472/images/cover.png)