The transition from a black-and-white sketch to a fully colored character is one of the most labor-intensive steps in animation and comic production. For decades, artists have manually filled in colors, ensuring that a character’s hair, eyes, and clothing remain consistent across thousands of frames. While automated tools have attempted to speed up this process, they often stumble when presented with a simple reality of animation: characters move.
When a character turns their head, zooms in, or changes pose, the geometry of the line art changes drastically compared to the reference sheet. Traditional colorization algorithms frequently fail here, painting outside the lines or confusing the color of a shirt with the color of the skin.
Enter MangaNinja, a new research paper proposing a method that leverages the power of diffusion models to solve this consistency problem. By introducing a novel “patch shuffling” training technique and a user-guided point control system, MangaNinja bridges the gap between the raw generative power of AI and the precise requirements of professional animation.

The Core Problem: Correspondence
To understand why MangaNinja is significant, we first need to look at the limitations of previous methods. Reference-based line art colorization aims to take a target sketch (the line art) and a colored reference image, then transfer the colors from the reference to the target.
Existing approaches generally fall into two categories:
- Paint-Bucket Approaches: These rely on traditional computer vision techniques to match regions. However, if the shape of the line art differs significantly from the reference (e.g., a profile view vs. a front view), these models struggle to find the right correspondence.
- Generative Diffusion Approaches: Models like Stable Diffusion are excellent at generating high-quality textures. However, they can be difficult to control. A standard diffusion model might color the line art beautifully, but it might arbitrarily change the character’s eye color or clothing pattern because it “hallucinated” a new style rather than strictly following the reference.
The researchers identified that the main bottleneck was semantic mismatch. When the reference image and the line art are too different spatially, the model loses track of which part of the reference corresponds to which part of the sketch. Furthermore, previous methods lacked a way for users to manually correct these errors without repainting the image themselves.
The MangaNinja Architecture
MangaNinja addresses these issues using a dual-branch architecture based on the U-Net, a common structure in image segmentation and diffusion models.
1. The Dual-Branch Design
The system processes two inputs simultaneously. As illustrated in the diagram below, the architecture consists of a Reference U-Net and a Denoising U-Net.

The process works as follows:
- Input: The model takes a colored reference frame and a target line art frame.
- Reference U-Net: This network encodes the reference image into high-level features (latent representations).
- Denoising U-Net: This is the main generator. It takes the line art and a noisy canvas, attempting to denoising it into a final colored image.
- Cross-Attention: Crucially, the features from the Reference U-Net are injected into the Denoising U-Net. This tells the generator what to paint based on the reference.
The mathematical mechanism for this injection is standard cross-attention. The “Query” comes from the target image (what we are painting), while the “Key” and “Value” come from the reference image (the source of the colors).

2. Progressive Patch Shuffling: Escaping the “Comfort Zone”
The researchers made a fascinating observation during development: the model was lazy.
When trained on video frames (where the reference and target are often very similar), the model learned to simply copy the global structure of the reference image onto the target. It wasn’t actually learning “this texture is hair” or “this texture is skin”; it was just overlaying the reference image. This meant that when the target pose was very different from the reference, the model failed.
To fix this, the team introduced Progressive Patch Shuffling.
During training, instead of feeding the clean reference image, they chop the reference image into small patches and shuffle them randomly. This destroys the global structure of the image. The model can no longer rely on the position of the pixels (e.g., “the face is always in the top center”). Instead, it is forced to learn local semantic matching. It has to learn that a specific texture patch corresponds to “hair” regardless of where it is located on the canvas.
They adopt a “coarse-to-fine” strategy, starting with large \(2 \times 2\) patches and progressively shuffling down to tiny \(32 \times 32\) grids. This forces the model out of its comfort zone, resulting in robust matching capabilities even when the reference and target poses are totally different.
3. Point-Driven Control
Even with a smart model, ambiguity exists. For example, a reference image might only show a character’s face, but the line art shows their full body. How should the model color the pants? Or, what if the model confuses a bag strap for a part of the shirt?
MangaNinja introduces an interactive Point-Driven Control scheme. Users can click a point on the reference and a corresponding point on the line art to explicitly tell the model: “This point here corresponds to that point there.”

This is implemented using a PointNet module. The user’s clicked points are encoded into embeddings and injected into the attention mechanism alongside the image features.
The attention equation is modified to include these point embeddings (\(E_{tar}\) and \(E_{ref}\)), adding them to the Query and Key vectors:

This biases the attention map. When the Denoising U-Net is trying to decide how to color a specific pixel, the point embedding pulls the attention toward the corresponding pixel in the reference image designated by the user.
To ensure the model listens to these points, the researchers use Condition Dropping during training. They occasionally hide the line art structure from the model, forcing it to reconstruct the colorization relying only on the sparse point hints.
4. Controlling the Influence
At inference time (when the user is actually using the tool), MangaNinja uses Multi-Classifier-Free Guidance. This allows the user to tweak two specific parameters:
- \(\omega_{ref}\): How strongly the model should stick to the reference image generally.
- \(\omega_{points}\): How strongly the model should obey the specific clicked points.

If complex manual correction is needed, the user can crank up \(\omega_{points}\) to override the model’s automatic decisions.
Experimental Results
The researchers trained MangaNinja on a massive dataset called Sakuga-42m, consisting of 42 million keyframes from anime. They filtered this down to high-quality video clips where they could extract natural pairs of “reference frame” and “target frame” from the same scene.
Qualitative Comparison
How does it look in practice? The results show a distinct improvement over previous state-of-the-art methods like BasicPBC (a non-generative method) and AnyDoor (a generative object customization method).

In the comparison above, notice the character in the second row (the elf-like figure).
- BasicPBC creates artifacts because it relies on pixel-locality; if the lines don’t match up, the paint bucket spills.
- AnyDoor and IP-Adapter capture the general vibe but lose specific details—notice the loss of the correct hair shading or clothing patterns.
- MangaNinja (Ours) maintains the sharp, precise shading typical of anime while correctly mapping the colors to the new pose.
Quantitative Analysis
To measure this mathematically, the authors constructed a benchmark of 200 diverse anime character pairs. They used metrics like PSNR (Peak Signal-to-Noise Ratio) for image quality and CLIP/DINO scores to measure semantic similarity (how well the AI understood the content).

MangaNinja achieved state-of-the-art results across the board. The low LPIPS score (lower is better) indicates that the perceptual difference between the generated image and the ground truth is very small.
Ablation Studies
Was the “Patch Shuffling” actually necessary? The ablation study confirms it was critical.

Comparing Row I (Base Model) and Row III (Base + Patch Shuffle), we see a massive jump in performance metrics. This validates the hypothesis that breaking the global structure during training forces the model to become a better semantic matcher.
Handling Challenging Scenarios
The true power of MangaNinja is revealed in “edge cases”—situations that usually break automated tools.
1. Missing Details and Extreme Poses
Often, a reference image is just a headshot, but the animator needs to color a full-body action shot. By using point guidance, the user can guide the model to color the pants or shoes by referencing the palette of the shirt, or simply allow the model’s learned priors to fill in the gaps plausibly while locking in the face details with points.
2. Multi-Reference Colorization
What if you have one reference image for the character’s face and a different reference image for their new outfit? MangaNinja can accept multiple reference inputs. The user simply places points on the face reference connecting to the line art’s face, and points on the outfit reference connecting to the line art’s body. The model seamlessly blends them.

3. Cross-Character Colorization
In a fun twist, the model is robust enough to apply the color scheme of one character to the line art of a completely different character. Because the model understands “hair is hair” and “eyes are eyes” (thanks to patch shuffling), it can transfer the texture of character A onto the geometry of character B.

Conclusion
MangaNinja represents a significant step forward in AI-assisted animation. By moving beyond simple style transfer and focusing on precise correspondence, it solves the specific headaches that animators face daily.
The combination of the Dual-Branch U-Net for feature extraction, Progressive Patch Shuffling for robust semantic learning, and Point-Driven Control for human-in-the-loop correction creates a tool that is both powerful and precise. For the anime and comic industries, tools like this could drastically reduce the time spent on repetitive coloring tasks, allowing artists to focus more on storyboarding, composition, and creative direction.
](https://deep-paper.org/en/paper/2501.08332/images/cover.png)