Sketches are the soul of visual art. Before an artist commits to a fully rendered painting, they start with a line drawing—a blueprint that captures the essential structure, layout, and proportions of the final piece. This process is intuitive and powerful because editing a sketch is far easier than making pixel-perfect adjustments to a finished color image.
Despite the recent explosion in generative AI—particularly diffusion models capable of producing stunning photorealistic images from text—the world of automated sketch generation has been surprisingly quiet. Existing tools often fall short of what artists truly need: precise control over the final output. It’s one thing to generate “a cat sitting on a mat,” but it’s another to specify that the cat should be in the top-left corner, facing right, and be a certain size.
This is the gap that a new research paper, CoProSketch, aims to fill. The researchers propose a novel framework that not only generates high-quality sketches from text but does so with remarkable controllability and a progressive workflow that welcomes human creativity into the process. Instead of a one-shot generation, CoProSketch produces a rough draft that you can edit yourself—before the model adds the final details.
As shown above, the core idea is to create a collaborative process: You provide a text prompt and a simple bounding box to define the layout. The model generates a rough sketch. If it’s not quite right, you can edit the lines and feed the modified sketch back into the model for refinement. This iterative, human-in-the-loop approach offers new possibilities for artists, designers, and creatives who want precise control over their outputs.
The Challenge: Why Diffusion Models Struggle with Sketches
At first glance, generating a sketch sounds easier than producing a photorealistic image—just black lines on white paper, right? But this binary nature is exactly what makes it difficult for standard diffusion models.
Diffusion models like Stable Diffusion excel at continuous, smooth data distributions—think soft color gradients in photographs. In contrast, a binarized sketch has abrupt jumps between pure white (pixel value 255) and pure black (pixel value 0). When fine-tuned directly on sketch data, these models tend to produce chaotic, blotchy results instead of clean, intentional lines.
The CoProSketch team identified this as a core technical hurdle. Their solution? Don’t represent sketches as binary images at all. Instead, they turned to a concept from 3D graphics: the Unsigned Distance Field (UDF).
The Secret Sauce: Representing Sketches as UDFs
An Unsigned Distance Field stores, for every pixel, its distance to the nearest edge or stroke. Instead of a harsh black/white bitmap, you get a smooth gradient: pixels close to a line have low values, while pixels farther away have high values.
This continuous representation is far easier for a diffusion model to learn. The researchers further improved the UDF with a transformation:
\[ f(u) = 1 - \exp\left(-\frac{u}{T}\right) \]Here, \( T \) is a scaling parameter based on image size. This function boosts the contrast near strokes, giving the neural network clearer signals to learn from. By doing so, the main problem—training diffusion models effectively on sketch data—was solved.
The CoProSketch Pipeline: Step-by-Step
With UDFs as the foundation, the pipeline proceeds in two stages: first generating a rough sketch, then refining it into a detailed one.
1. SketchGenerator: The Heart of the System
The SketchGenerator is a fine-tuned Stable Diffusion XL (SDXL) model responsible for generating the UDF. Two key modifications enable control:
- Position Control: The user’s bounding box is converted into a mask, encoded via a Variational Autoencoder (VAE), and concatenated with the noisy UDF latent at the U-Net’s input. This tells the model exactly where to draw.
- Varying Levels of Detail: A “stage indicator” signals whether to produce a rough contour or a detailed sketch. The stage embedding is added to the time embedding—similar to how diffusion timesteps are processed—enabling control without major architectural changes.
2. UDF2Sketch: Turning Distance Fields into Lines
The SketchGenerator outputs a UDF, but we need a crisp final sketch. Simple thresholding or Marching Squares contour extraction yields jagged, unaesthetic results. Inspired by InformativeDrawing, the researchers train UDF2Sketch—a lightweight encoder-decoder network with ResNet blocks—using adversarial loss (to match target style), CLIP-based semantic loss (to preserve content), and cycle loss (to ensure consistency). This produces clean, expressive lines.
3. UDF2Mask: Creating Precise Refinement Masks
For the detailed stage, a more accurate mask is needed. UDF2Mask, based on MobileSAM, generates pixel-perfect instance masks from the rough UDF and bounding box. This improved control signal ensures refined sketches align perfectly with edited rough drafts. The training loss is:
\[ L = \lambda_f \times L_{focal} + \lambda_d \times L_{dice} \]Building the Sketch Data Engine
A great model requires great data. With no large paired text-sketch dataset available, the authors built one from scratch using an automated Sketch Data Engine.
Pipeline Steps:
- Start with RGB Images: Sources included COCO2017, Anime Colorization, and salient object detection (SOD) datasets.
- Generate Detailed Prompts: Gemini AI provided rich image captions—critical for diffusion quality.
- Extract Masks: Outputs from a Salient Object Detection model and SAM2 were merged to produce accurate object masks, from which bounding boxes and contours were derived.
- Generate Detailed Sketches: The InformativeDrawing model converted RGB images to detailed sketches.
- Filter Results: Low-quality or tiny-mask samples were discarded, resulting in ~100,000 high-quality text-sketch pairs.
How Does CoProSketch Perform?
Qualitative Comparisons
CoProSketch produces clean sketches perfectly confined within the specified bounding boxes—a capability missing in DiffSketcher. Two-stage pipelines that rely on intermediate RGB images often fail to confine the sketch and introduce clutter.
Quantitative Metrics
CoProSketch scored highest on both Aesthetic Score (predicting human preference) and CLIP Score (semantic alignment between sketch and text).
User Study Results
To test beyond automated metrics, 33 participants rated outputs on aesthetics, semantic similarity, and positional control accuracy.
Participants overwhelmingly favored CoProSketch for semantic alignment and precise positional control, confirming its practical advantage for creative work.
Why Every Component Matters: Ablation Study
Removing key components confirmed their importance:
- No UDF Representation: Training directly on binarized sketches produced messy, low-consistency outputs.
- No UDF2Mask: Detailed sketches often exceeded bounding box boundaries.
- No UDF2Sketch: Traditional decoding methods gave fragmented, unaesthetic results.
Applications: Beyond Pretty Drawings
Intuitive Editing
In the example above, the task was to straighten a flamingo’s neck. Editing the final RGB image led to artifacts, while editing the rough sketch produced a perfect final image after regeneration.
Composing Complex Scenes
Complex multi-object scenes from a single prompt often fail due to occlusion and positioning issues. CoProSketch allows each object to be generated separately, then combined on a single canvas with precise mask-based layering.
Conclusion
CoProSketch is a major step forward for controllable, artist-centric generative AI. By solving the representation problem with UDFs and embracing a progressive, editable workflow, the authors created a system that balances quality, control, and creative flexibility.
Its most notable strength is putting the user back into the loop—transforming them from passive prompt-giver to active co-creator. While current aesthetic quality is bounded by the image-to-sketch model used for data generation, this framework is future-proof: as better methods emerge, they can slot directly into the pipeline.
If you care about precision, adaptability, and artistry in generative tools, CoProSketch offers a blueprint for the next generation of creative AI.