Introduction

In the rapidly evolving world of Generative AI, we have witnessed a fragmentation of tools. If you want to generate an image from scratch, you might use Stable Diffusion or Midjourney. If you want to change the style of an existing photo, you might look for a style-transfer adapter. If you want to insert a specific product into a background, you might need a specialized object-insertion model like AnyDoor.

While these specialized “experts” perform well in their silos, they limit the workflow. Real-world creative processes are fluid; we might want to generate a scene, then remove an object, then change the lighting, and finally insert a character from another photo—all in one go.

Enter UniReal, a new framework presented by researchers from The University of Hong Kong and Adobe. UniReal proposes a paradigm shift: instead of building separate models for separate tasks, why not build a single, universal framework that treats every image task—whether it’s creation, editing, or composition—as a form of discontinuous video generation?

By treating input and output images as frames in a video, UniReal learns to model the “dynamics” of the real world—shadows, reflections, and physical interactions—allowing it to handle a massive variety of tasks with a single model.

Figure 1. Demonstrations of UniReal’s versatile capabilities. As a universal framework, UniReal supports a broad spectrum of image generation and editing tasks within a single model.

In this post, we will deconstruct the UniReal paper. We will explore how treating images as video frames solves consistency problems, how the architecture manages complex multi-image inputs, and how the researchers ingeniously used video data to train a master editor.

The Core Problem: The Specialist Trap

To understand why UniReal is significant, we first need to look at the current landscape. Most diffusion-based methods are “specialists.”

Text-to-Image (T2I): Great at creating new images but terrible at keeping specific details from a reference image.
Instructive Editing: Models like InstructPix2Pix can follow commands like “make it rain,” but they often struggle with complex structural changes or inserting specific objects.
Customization: Methods like DreamBooth require fine-tuning a model for every new subject, which is computationally heavy and slow.

The issue is that these tasks share fundamental requirements. They all need to preserve consistency (keeping the dog looking like that dog) while introducing variation (moving the dog to the beach).

The authors of UniReal observed that video generation models naturally solve this balance. A video model generates frame \(t+1\) based on frame \(t\). It must keep the subject consistent while accounting for movement (variation). UniReal asks: What if we treat an edit—like removing a backpack or changing a background—as simply moving from one video frame to another?

The UniReal Method

UniReal is a unified framework designed to address almost any image-level task. It achieves this through a combination of a video-inspired architecture and a sophisticated prompting system.

1. Discontinuous Video Generation

The central thesis of UniReal is that image editing is essentially “discontinuous video generation.”

Continuous Video: Frame 1 is a man walking. Frame 2 is the man slightly forward. The change is small and temporal.
Discontinuous Video (UniReal’s view): Frame 1 is a man with a backpack. Frame 2 is the man without the backpack. The “motion” here is the edit itself.

By adopting a video generation architecture, UniReal can process an arbitrary number of input images (source, references, conditions) and output images as a sequence of frames. It uses Full Attention mechanisms found in video transformers to look at all “frames” simultaneously, allowing it to understand the relationship between a reference object and the target background deeply.

2. Architecture: Diffusion Transformer with Full Attention

Let’s look under the hood. UniReal does not use the standard U-Net architecture found in older diffusion models. Instead, it uses a Diffusion Transformer (DiT), similar to the architecture behind Sora or Stable Diffusion 3.

Figure 2. Overall pipeline of UniReal. We formulate image generation and editing tasks as discontinuous frame generation.

As shown in Figure 2 above, the pipeline works as follows:

VAE Encoding: All input images (the image to be edited, reference objects, or condition maps) are compressed into a latent space using a VAE encoder.
Patchification: These latent images and the random noise (for generation) are chopped into small patches, becoming “visual tokens.”
Token Concatenation: This is the clever part. The model concatenates visual tokens from the input images, the noise tokens for the output, and the text tokens from the prompt into one massive 1D sequence.
Full Attention Transformer: A transformer processes this long sequence. Because it uses “full attention,” every token can “see” every other token. The noise tokens (where the result will appear) can directly attend to the pixels of the reference dog or the background scene, ensuring high fidelity.

3. Hierarchical Prompts: Solving Ambiguity

One of the biggest challenges in building a universal model is ambiguity.

Imagine giving a model a picture of a dog and the text “A dog.”

If the task is editing, the model should keep the dog and perhaps clean up the image.
If the task is customization, the model should take that dog’s identity and put it in a new pose.
If the task is control, the image might be a depth map.

To solve this, UniReal introduces Hierarchical Prompts. It doesn’t just rely on the user’s text. It breaks instructions down into three layers:

Base Prompt: The user’s actual instruction (e.g., “Put this dog on the grass”).
Context Prompt: Global settings describing the task type (e.g., “Realistic style,” “With reference object,” “Static scenario”).
Image Prompt: Labels assigned to each input image to define its role:

Canvas: The background or image to edit on.
Asset: The object to insert or reference.
Control: A depth map, edge map, or mask.

Figure 8. Effects of hierarchical prompt. The same input could correspond to various types of targets when given different image prompts and context prompts.

Figure 8 illustrates why this matters. In the top row, changing the image prompt from [Canvas, Asset] to [Canvas, Control] completely changes how the second image is used. In the bottom row, changing the Context Prompt from “Synthetic” to “Realistic” dramatically shifts the artistic style of the output. This hierarchy gives the model precise control over how to use its inputs.

Data Construction: Learning from the World

A model is only as good as its data. The problem with image editing is that large-scale, high-quality “before and after” datasets are rare. Hand-drawing masks or Photoshopping millions of images is impossible.

UniReal leverages Videos as a scalable source of truth. Real-world videos naturally contain the “dynamics” the model needs to learn.

Figure 3. Data construction pipeline. Starting from raw videos, we use off-the-shelf models to construct data for different kinds of tasks.

The researchers built an automated pipeline (Figure 3) to harvest training data from raw videos:

Video Frame2Frame: They take two random frames from a video. The difference between them (camera movement, subject rotation, lighting change) serves as a training example for “editing.” An LLM generates a caption describing the change.
Video Multi-object: Using segmentation models (like SAM 2), they cut out objects from one frame and ask the model to generate a different frame containing those objects. This teaches customization and composition.
Video Control: They extract depth maps or edge maps from video frames to create training pairs for controllable generation.

By mixing this scalable video data with existing datasets (like InstructPix2Pix), UniReal learns robust physical interactions (shadows, perspective) that are often missing from synthetic editing datasets.

Table 1 breaks down the data sources. Notice the heavy reliance on the newly constructed video datasets (bottom block), comprising millions of samples.

Table 1. Statistics of datasets used for training. We mix the existing datasets with our newly constructed video-based datasets.

Experiments and Capabilities

UniReal demonstrates state-of-the-art performance across several difficult tasks. Let’s look at the results.

1. Instructive Image Editing

This task involves changing an image based on a text command (e.g., “Add an elephant”).

Figure 4. Comparison results for instructive image editing. UniReal shows significant advantages in the aspects of instruction-following and generation quality.

In Figure 4, compare UniReal (far right) to competitors like InstructPix2Pix or OmniGen.

Row 1 (Elephant): UniReal integrates the elephant into the water realistically, including reflections and submerged parts. Other models often paste the elephant on top like a sticker.
Row 2 (Remove Toy): UniReal completely removes the yellow duck and hallucinates the missing rocks behind it perfectly.
Row 3 (Ants): The prompt “Small ants lift up the car” is complex. UniReal actually renders ants interacting with the car, whereas other models struggle to visualize the interaction.

2. Quantitative Evaluation

The researchers backed up these visuals with hard numbers. They used metrics like CLIP (to measure how well the image matches the text) and DINO (to measure visual quality).

Table 2. Comparison results for instructive image editing on EMU Edit and MagicBrush test sets.

As shown in Table 2, UniReal consistently scores higher on CLIP_out (output quality) and CLIP_dir (directional edit accuracy) compared to specialized editors like UltraEdit or generalists like OmniGen.

3. Image Customization & Object Insertion

This is arguably the hardest task in generative AI: taking a specific object (not just “a dog,” but “this specific toy dog”) and putting it in a new context.

Figure 5. Qualitative comparison for image customization. UniReal demonstrates significant advantages compared with other zero-shot models.

In Figure 5, look at the first row. The task is to put the blue “Tuetan” can onto a table of fruit.

UniReal preserves the text on the can and its metallic texture while lighting it correctly for the new scene.
OmniGen and others often distort the logo or lose the shape of the can.

The bottom row shows Multi-Subject Composition (fighting plushies). UniReal manages to keep the identity of both toys distinct, while other models often blend their features together.

4. Comparison with AnyDoor

AnyDoor is a popular model specifically designed for object insertion. However, it usually requires user-provided masks to tell it where to put the object.

Figure 7. Comparison results for object insertion. Our method could automatically adjust the status of the reference object according to the environment.

Figure 7 highlights a major advantage of UniReal: it doesn’t need masks. It infers the location from the text.

Row 1 (Dog): The model puts the dog in the pool. Notice the refraction in the water. AnyDoor struggles to blend the dog’s fur with the water surface.
Row 3 (Person): UniReal matches the lighting and color tone of the background perfectly, making the person look like they were originally in the photo.

Why Video Data is the “Secret Sauce”

One of the most interesting findings in the paper is the ablation study regarding training data. The researchers trained a version of UniReal only on the Video Frame2Frame dataset—meaning it never saw explicit “editing” instructions, just pairs of video frames.

Figure 9. Ablation study for the training data. The model trained only on video data could master many editing tasks.

As Figure 9 shows, the model trained only on video (third column) can already perform tasks like adding a dog or changing a car’s color. This proves the hypothesis: learning the natural variations between video frames teaches a model how to edit images. The full model (last column) is sharper and more obedient, but the core capability comes from the video data.

Emergent Capabilities and Future Potential

Because UniReal was trained on such a diverse set of tasks and data, it exhibits “emergent capabilities”—skills it wasn’t explicitly trained for but can perform by combining its knowledge.

Figure 10. More applications supported by UniReal.

Figure 10 (Right Block) showcases these novel abilities:

Multi-object Insertion: Adding multiple distinct items (toys, backpacks) into a scene seamlessly.
Local Reference Editing: Transferring a hairstyle from a reference image to a target person without changing their face.
Layer-aware Editing: Placing an object “behind” existing objects (like the elephant appearing behind the fence) without needing a depth map input.

Conclusion

UniReal represents a significant step toward a “General Vision Intelligence.” By moving away from specialized, task-specific pipelines and embracing a unified, video-inspired framework, it solves multiple problems at once.

The key takeaways for students and researchers are:

Unified Architectures Win: A single powerful transformer (DiT) with full attention can outperform specialized UNets if the data is structured correctly.
Video is a Super-Signal: Static images are limited. Videos contain the physics, lighting, and 3D understanding required for realistic editing.
Prompt Engineering is Architectural: The Hierarchical Prompt system shows that how we ask the model to do something is just as important as the model itself.

UniReal suggests a future where we won’t need a “Background Remover” and a “Style Transfer-er.” We will simply have a visual engine that understands the dynamics of the world, ready to edit reality frame by frame.

Introduction#

The Core Problem: The Specialist Trap#

The UniReal Method#

1. Discontinuous Video Generation#

2. Architecture: Diffusion Transformer with Full Attention#

3. Hierarchical Prompts: Solving Ambiguity#

Data Construction: Learning from the World#

Experiments and Capabilities#

1. Instructive Image Editing#

2. Quantitative Evaluation#

3. Image Customization & Object Insertion#

4. Comparison with AnyDoor#

Why Video Data is the “Secret Sauce”#

Emergent Capabilities and Future Potential#

Conclusion#