CoProSketch: The AI Sketch Generator That Actually Lets You Edit

Sketches are the soul of visual art. Before an artist commits to a fully rendered painting, they start with a line drawing—a blueprint that captures the essential structure, layout, and proportions of the final piece. This process is intuitive and powerful because editing a sketch is far easier than making pixel-perfect adjustments to a finished color image.

Despite the recent explosion in generative AI—particularly diffusion models capable of producing stunning photorealistic images from text—the world of automated sketch generation has been surprisingly quiet. Existing tools often fall short of what artists truly need: precise control over the final output. It’s one thing to generate “a cat sitting on a mat,” but it’s another to specify that the cat should be in the top-left corner, facing right, and be a certain size.

This is the gap that a new research paper, CoProSketch, aims to fill. The researchers propose a novel framework that not only generates high-quality sketches from text but does so with remarkable controllability and a progressive workflow that welcomes human creativity into the process. Instead of a one-shot generation, CoProSketch produces a rough draft that you can edit yourself—before the model adds the final details.

A high-level overview of the CoProSketch pipeline, showing how a user can go from a creative idea to a rough sketch, edit it, and compose multiple sketches into a final, detailed scene.

As shown above, the core idea is to create a collaborative process: You provide a text prompt and a simple bounding box to define the layout. The model generates a rough sketch. If it’s not quite right, you can edit the lines and feed the modified sketch back into the model for refinement. This iterative, human-in-the-loop approach offers new possibilities for artists, designers, and creatives who want precise control over their outputs.

The Challenge: Why Diffusion Models Struggle with Sketches

At first glance, generating a sketch sounds easier than producing a photorealistic image—just black lines on white paper, right? But this binary nature is exactly what makes it difficult for standard diffusion models.

Diffusion models like Stable Diffusion excel at continuous, smooth data distributions—think soft color gradients in photographs. In contrast, a binarized sketch has abrupt jumps between pure white (pixel value 255) and pure black (pixel value 0). When fine-tuned directly on sketch data, these models tend to produce chaotic, blotchy results instead of clean, intentional lines.

The CoProSketch team identified this as a core technical hurdle. Their solution? Don’t represent sketches as binary images at all. Instead, they turned to a concept from 3D graphics: the Unsigned Distance Field (UDF).

The Secret Sauce: Representing Sketches as UDFs

An Unsigned Distance Field stores, for every pixel, its distance to the nearest edge or stroke. Instead of a harsh black/white bitmap, you get a smooth gradient: pixels close to a line have low values, while pixels farther away have high values.

A visual explanation of how a binarized sketch is converted into a raw UDF and then a transformed UDF for better network training.

This continuous representation is far easier for a diffusion model to learn. The researchers further improved the UDF with a transformation:

\[ f(u) = 1 - \exp\left(-\frac{u}{T}\right) \]

Here, \( T \) is a scaling parameter based on image size. This function boosts the contrast near strokes, giving the neural network clearer signals to learn from. By doing so, the main problem—training diffusion models effectively on sketch data—was solved.

The CoProSketch Pipeline: Step-by-Step

With UDFs as the foundation, the pipeline proceeds in two stages: first generating a rough sketch, then refining it into a detailed one.

The detailed architecture of the CoProSketch pipeline, showing the two-stage process (rough and detailed) and the modified U-Net architecture.

1. SketchGenerator: The Heart of the System

The SketchGenerator is a fine-tuned Stable Diffusion XL (SDXL) model responsible for generating the UDF. Two key modifications enable control:

Position Control: The user’s bounding box is converted into a mask, encoded via a Variational Autoencoder (VAE), and concatenated with the noisy UDF latent at the U-Net’s input. This tells the model exactly where to draw.
Varying Levels of Detail: A “stage indicator” signals whether to produce a rough contour or a detailed sketch. The stage embedding is added to the time embedding—similar to how diffusion timesteps are processed—enabling control without major architectural changes.

2. UDF2Sketch: Turning Distance Fields into Lines

The SketchGenerator outputs a UDF, but we need a crisp final sketch. Simple thresholding or Marching Squares contour extraction yields jagged, unaesthetic results. Inspired by InformativeDrawing, the researchers train UDF2Sketch—a lightweight encoder-decoder network with ResNet blocks—using adversarial loss (to match target style), CLIP-based semantic loss (to preserve content), and cycle loss (to ensure consistency). This produces clean, expressive lines.

For the detailed stage, a more accurate mask is needed. UDF2Mask, based on MobileSAM, generates pixel-perfect instance masks from the rough UDF and bounding box. This improved control signal ensures refined sketches align perfectly with edited rough drafts. The training loss is:

\[ L = \lambda_f \times L_{focal} + \lambda_d \times L_{dice} \]

Building the Sketch Data Engine

A great model requires great data. With no large paired text-sketch dataset available, the authors built one from scratch using an automated Sketch Data Engine.

The pipeline used to construct the training dataset, starting from RGB images and generating text captions, masks, and sketches.

Pipeline Steps:

Start with RGB Images: Sources included COCO2017, Anime Colorization, and salient object detection (SOD) datasets.
Generate Detailed Prompts: Gemini AI provided rich image captions—critical for diffusion quality.
Extract Masks: Outputs from a Salient Object Detection model and SAM2 were merged to produce accurate object masks, from which bounding boxes and contours were derived.
Generate Detailed Sketches: The InformativeDrawing model converted RGB images to detailed sketches.
Filter Results: Low-quality or tiny-mask samples were discarded, resulting in ~100,000 high-quality text-sketch pairs.

Examples of rough and detailed sketches from the custom-built dataset.

How Does CoProSketch Perform?

Qualitative Comparisons

A side-by-side comparison of sketches generated by CoProSketch and various baseline methods for different prompts.

CoProSketch produces clean sketches perfectly confined within the specified bounding boxes—a capability missing in DiffSketcher. Two-stage pipelines that rely on intermediate RGB images often fail to confine the sketch and introduce clutter.

Quantitative Metrics

CoProSketch scored highest on both Aesthetic Score (predicting human preference) and CLIP Score (semantic alignment between sketch and text).

A table showing the aesthetic and CLIP scores for CoProSketch compared to several baseline methods.

User Study Results

To test beyond automated metrics, 33 participants rated outputs on aesthetics, semantic similarity, and positional control accuracy.

Results from the user study, showing that users overwhelmingly preferred CoProSketch for semantic similarity and positional control.

Participants overwhelmingly favored CoProSketch for semantic alignment and precise positional control, confirming its practical advantage for creative work.

Why Every Component Matters: Ablation Study

Removing key components confirmed their importance:

A visual ablation study showing the degradation in quality when key components like UDF representation or the UDF2Sketch module are removed. A quantitative ablation study showing the drop in performance scores when components are removed.

No UDF Representation: Training directly on binarized sketches produced messy, low-consistency outputs.
No UDF2Mask: Detailed sketches often exceeded bounding box boundaries.
No UDF2Sketch: Traditional decoding methods gave fragmented, unaesthetic results.

Applications: Beyond Pretty Drawings

Intuitive Editing

A comparison showing how much easier and more effective it is to edit a rough sketch in the CoProSketch pipeline versus editing a final RGB image.

In the example above, the task was to straighten a flamingo’s neck. Editing the final RGB image led to artifacts, while editing the rough sketch produced a perfect final image after regeneration.

Composing Complex Scenes

An example of composing a complex living room scene by generating individual furniture pieces and layering them together.

Complex multi-object scenes from a single prompt often fail due to occlusion and positioning issues. CoProSketch allows each object to be generated separately, then combined on a single canvas with precise mask-based layering.

Conclusion

CoProSketch is a major step forward for controllable, artist-centric generative AI. By solving the representation problem with UDFs and embracing a progressive, editable workflow, the authors created a system that balances quality, control, and creative flexibility.

Its most notable strength is putting the user back into the loop—transforming them from passive prompt-giver to active co-creator. While current aesthetic quality is bounded by the image-to-sketch model used for data generation, this framework is future-proof: as better methods emerge, they can slot directly into the pipeline.

If you care about precision, adaptability, and artistry in generative tools, CoProSketch offers a blueprint for the next generation of creative AI.

The Challenge: Why Diffusion Models Struggle with Sketches#

The Secret Sauce: Representing Sketches as UDFs#

The CoProSketch Pipeline: Step-by-Step#

1. SketchGenerator: The Heart of the System#

2. UDF2Sketch: Turning Distance Fields into Lines#

3. UDF2Mask: Creating Precise Refinement Masks#

Building the Sketch Data Engine#

How Does CoProSketch Perform?#

Qualitative Comparisons#

Quantitative Metrics#

User Study Results#

Why Every Component Matters: Ablation Study#

Applications: Beyond Pretty Drawings#

Intuitive Editing#

Composing Complex Scenes#

Conclusion#