The rise of generative AI has been nothing short of explosive. Models like Stable Diffusion, Midjourney, and DALL·E can conjure breathtaking images from simple text prompts, democratizing artistic creation in ways we’ve never seen before. But this revolution comes with a controversial side: these powerful models are often trained on vast internet-sourced datasets without the explicit consent of original artists. This practice has sparked fierce debates about copyright, ownership, and the nature of creativity.

Who owns an AI-generated image in the style of a living artist? Is it inspiration—or imitation?

The current debate relies on a key assumption: to learn an artistic style, a model must be trained on massive amounts of existing art. But what if that assumption is wrong?

Researchers from MIT, Northeastern University, and ShanghaiTech University challenge this premise with a deceptively simple question: Can a generative model learn to paint like an artist without ever having seen a painting before?

Their answer is yes, through a system they call Blank Canvas Diffusion—a model trained exclusively on photographs—and a lightweight Art Style Adapter that learns styles from just a handful of “opt-in” examples.

This work not only delivers a technical breakthrough but reframes the conversation around ethical AI. It points toward consent-based systems while revealing how difficult it may truly be to prevent style imitation. Let’s dive in.

A schematic showing the concept of Blank Canvas Diffusion. A model trained on no paintings (“blank canvas”) can be combined with an opt-in art adapter—trained on just a few examples—to generate new images in that specific style.

The Problem with a Data-Hungry World

Modern text-to-image models—particularly denoising diffusion models—are fundamentally data-driven. They learn by detecting patterns across billions of image-text pairs. Large-scale datasets like LAION-5B are snapshots of the public internet, mixing photographs and countless copyrighted artworks.

This brings serious legal and ethical challenges. Artists object to their work being used for training without permission, leading to lawsuits and calls for regulation. Because these models can replicate styles with uncanny fidelity, the boundary between inspiration and imitation grows hazy.

Some mitigation ideas have emerged:

  • Concept Erasure – Removing specific concepts (including artistic styles) from a model after training.
  • Opt-Out Systems – Tools that allow artists to exclude their work from future datasets.
  • Curated Data – Using fully licensed or public-domain datasets, such as CommonCanvas trained only on Creative Commons images.

While valuable, these methods start with models already steeped in art. The “Opt-In Art” project pushes curation to its logical extreme: building a fully art-agnostic model—a camera-informed blank slate with no visual knowledge of paintings, drawings, or illustrations.

Building from a Blank Slate: Blank Canvas Diffusion

The effort began with building a model entirely free from non-photographic art, requiring both a meticulously filtered dataset and careful architectural choices.

The Blank Canvas Dataset

The starting point was the SA-1B dataset—a large collection of camera-captured images for object segmentation. Despite its photographic focus, art seeps into everyday photos: museum interiors, murals, product logos, ornate architecture.

To eliminate these influences, the team created a two-stage filter:

  1. Text-Based Filtering – Scan image captions, removing any with art-related keywords like “painting,” “drawing,” “illustration,” “cubism,” “logo,” etc. This alone removed 4.7% of the dataset.
  2. Image-Based Filtering – Use CLIP to measure visual similarity with art concepts. By inspecting samples across similarity scores, they set a threshold to cut further art-like images—another 16.7% gone.

Examples of removed images (left) and retained images (right). Filtering removed paintings, drawings, and fine art, but kept real-world photography—even if it had some decorative qualities.

After filtering, the Blank Canvas Dataset had over 9.1M pure photographic image-text pairs. Manual checks confirmed effectiveness: in a 10,000-image sample, artistic content dropped from 315 instances to just 71, mostly sculptures or architecture, which they chose to keep.

Table showing the dramatic reduction in artworks after filtering. Paintings, drawings, and illustrations drop to nearly zero.

An Art-Agnostic Architecture

Dataset curation was only half the solution. They also prevented art knowledge leakage via pre-trained components:

  • VAE & U-Net – Trained from scratch on the Blank Canvas dataset.
  • Text Encoder – Replaced CLIP (trained on images and text) with BERT, trained only on text. BERT knows conceptual meanings—e.g., “painting”—but has no visual association.

With this, the base model was genuinely art-naïve. Asked to recreate famous works like Mona Lisa or Starry Night, it produced either abstract noise or photographic interpretations—showing no stylistic recall.

Stable Diffusion 1.4 can recreate famous artworks from prompts, but Blank Canvas Diffusion produces abstract/non-stylistic results, proving its lack of prior art exposure.

Teaching the Blank Canvas to Paint: The Art Style Adapter

Now the twist—how to teach style to an art-naïve model?

The Art Style Adapter is a lightweight module trained on just a few “opt-in” works by an artist (as few as 9, up to ~50 in experiments). These adapters are based on LoRA—Low-Rank Adaptation, a technique for efficiently fine-tuning large models.

Training steps:

  1. Collect Style Data – Small set of paintings, captioned for content. Append a special trigger to prompts, like “…in the style of V* art.”
  2. Fine-Tune with Two Losses – Style and content losses ensure style precision without corrupting photographic generation.

Diagram of the Art Adapter training pipeline. Style examples guide the model to match both the content and visual character of the target style.

Style Loss ensures images match the target style when prompted with the style tag:

\[ \mathcal{L}_{\mathbf{S}}(\theta') = \|\epsilon_{\theta \cup \theta'}(X_t, C^*, t) - \epsilon\|^2 \]

Content Loss keeps core content fidelity when style tags are absent:

\[ \mathcal{L}_{\mathbf{C}}(\theta') = \|\epsilon_{\theta \cup \theta'}(X_t, C, t) - \epsilon_{\theta}(X_t, C, t)\|^2 \]

They combine them as:

\[ L = L_S + w \cdot L_C \]

This means the model applies the learned style only when explicitly triggered, preserving its natural-image skills otherwise.

Experiments and Results

Base Model Performance

Even without art training, Blank Canvas Diffusion produces high-quality photographs. On general benchmarks like COCO, scores are slightly lower than models with hundreds of millions of images—but strong enough to be a solid backbone.

Performance comparison: Blank Canvas vs other models. While it lags slightly on COCO due to domain mismatch, performance is competitive within its training domain.

Artistic Style Generation

Adapters trained on 17 artists from WikiArt enabled both text-prompted art creation and image stylization.

For example, in André Derain’s Fauvist style, the Blank Canvas + Adapter created vivid, colorful compositions comparable to Stable Diffusion 1.4—despite having never seen a painting beforehand.

Top row: Blank Canvas Diffusion + Art Adapter; Bottom row: Stable Diffusion 1.4 generations.

For Vincent van Gogh stylization, it matched or exceeded specialized baselines in balancing style and content.

Comparison of Van Gogh stylization across multiple baselines, including StyleAligned, Plug-and-Play, InstructPix2Pix, and CycleGAN.

Human Preference Studies

In large-scale Mechanical Turk tests, participants saw three reference style images and picked which of two outputs matched better.

Results:

  • 76.2% preference for Blank Canvas + Adapter over Stable Diffusion 1.4 (text prompts) in artistic generation.
  • Strong preference (67%) over StyleAligned when both applied to Blank Canvas backbone.
  • Consistently outperformed training-free style transfer methods in the art-naïve setting.

User study bar chart (left) showing preference rates. Scatter plot (right) shows strong style–content balance for Blank Canvas + Adapter.

This underscores the paper’s core claim: pre-training on art is not necessary for strong style learning.

Data Attribution Insights

Where does the model find creative substance?

Analysis shows style cues come from the small art dataset, while content cues are from photographs in the Blank Canvas dataset.

In a Picasso-style buffet scene, cubist forms come from the opt-in art samples; object shapes and table layout from photo training.

Attribution examples for Picasso, Matisse, and Lichtenstein generations: art dataset gives style; photo dataset gives real-world content.

In a Matisse beach bar, the model reused natural imagery of bamboo structures while adapting colors and forms per Matisse’s style. It “repaints” the photographic world through learned artistic rules.

Conclusion: A Double-Edged Brushstroke

The Opt-In Art framework shows that high-quality artistic imagery can be learned without mass art pretraining—just a blank canvas model and a small opt-in sample.

On one hand, this enables ethically grounded AI: artists could license adapters for their style, ensuring consent. On the other, it complicates protection—styles can be learned from a few public examples, making “opt-out” insufficient.

Policy debates must expand beyond training data inclusion to address adaptation, attribution, and fair use. Without such considerations, even responsible base models could be tuned to replicate specific styles from minimal material.

This study is both technical blueprint and ethical imperative. It paints a future where generative AI can respect consent, but also challenges us to rethink how to safeguard creative identity.

The canvas may start blank—but what we choose to fill it with will define the landscape ahead.