Introduction
In the rapidly evolving world of Generative AI, we have witnessed a fragmentation of tools. If you want to generate an image from scratch, you might use Stable Diffusion or Midjourney. If you want to change the style of an existing photo, you might look for a style-transfer adapter. If you want to insert a specific product into a background, you might need a specialized object-insertion model like AnyDoor.
While these specialized “experts” perform well in their silos, they limit the workflow. Real-world creative processes are fluid; we might want to generate a scene, then remove an object, then change the lighting, and finally insert a character from another photo—all in one go.
Enter UniReal, a new framework presented by researchers from The University of Hong Kong and Adobe. UniReal proposes a paradigm shift: instead of building separate models for separate tasks, why not build a single, universal framework that treats every image task—whether it’s creation, editing, or composition—as a form of discontinuous video generation?
By treating input and output images as frames in a video, UniReal learns to model the “dynamics” of the real world—shadows, reflections, and physical interactions—allowing it to handle a massive variety of tasks with a single model.

In this post, we will deconstruct the UniReal paper. We will explore how treating images as video frames solves consistency problems, how the architecture manages complex multi-image inputs, and how the researchers ingeniously used video data to train a master editor.
The Core Problem: The Specialist Trap
To understand why UniReal is significant, we first need to look at the current landscape. Most diffusion-based methods are “specialists.”
- Text-to-Image (T2I): Great at creating new images but terrible at keeping specific details from a reference image.
- Instructive Editing: Models like InstructPix2Pix can follow commands like “make it rain,” but they often struggle with complex structural changes or inserting specific objects.
- Customization: Methods like DreamBooth require fine-tuning a model for every new subject, which is computationally heavy and slow.
The issue is that these tasks share fundamental requirements. They all need to preserve consistency (keeping the dog looking like that dog) while introducing variation (moving the dog to the beach).
The authors of UniReal observed that video generation models naturally solve this balance. A video model generates frame \(t+1\) based on frame \(t\). It must keep the subject consistent while accounting for movement (variation). UniReal asks: What if we treat an edit—like removing a backpack or changing a background—as simply moving from one video frame to another?
The UniReal Method
UniReal is a unified framework designed to address almost any image-level task. It achieves this through a combination of a video-inspired architecture and a sophisticated prompting system.
1. Discontinuous Video Generation
The central thesis of UniReal is that image editing is essentially “discontinuous video generation.”
- Continuous Video: Frame 1 is a man walking. Frame 2 is the man slightly forward. The change is small and temporal.
- Discontinuous Video (UniReal’s view): Frame 1 is a man with a backpack. Frame 2 is the man without the backpack. The “motion” here is the edit itself.
By adopting a video generation architecture, UniReal can process an arbitrary number of input images (source, references, conditions) and output images as a sequence of frames. It uses Full Attention mechanisms found in video transformers to look at all “frames” simultaneously, allowing it to understand the relationship between a reference object and the target background deeply.
2. Architecture: Diffusion Transformer with Full Attention
Let’s look under the hood. UniReal does not use the standard U-Net architecture found in older diffusion models. Instead, it uses a Diffusion Transformer (DiT), similar to the architecture behind Sora or Stable Diffusion 3.

As shown in Figure 2 above, the pipeline works as follows:
- VAE Encoding: All input images (the image to be edited, reference objects, or condition maps) are compressed into a latent space using a VAE encoder.
- Patchification: These latent images and the random noise (for generation) are chopped into small patches, becoming “visual tokens.”
- Token Concatenation: This is the clever part. The model concatenates visual tokens from the input images, the noise tokens for the output, and the text tokens from the prompt into one massive 1D sequence.
- Full Attention Transformer: A transformer processes this long sequence. Because it uses “full attention,” every token can “see” every other token. The noise tokens (where the result will appear) can directly attend to the pixels of the reference dog or the background scene, ensuring high fidelity.
3. Hierarchical Prompts: Solving Ambiguity
One of the biggest challenges in building a universal model is ambiguity.
Imagine giving a model a picture of a dog and the text “A dog.”
- If the task is editing, the model should keep the dog and perhaps clean up the image.
- If the task is customization, the model should take that dog’s identity and put it in a new pose.
- If the task is control, the image might be a depth map.
To solve this, UniReal introduces Hierarchical Prompts. It doesn’t just rely on the user’s text. It breaks instructions down into three layers:
- Base Prompt: The user’s actual instruction (e.g., “Put this dog on the grass”).
- Context Prompt: Global settings describing the task type (e.g., “Realistic style,” “With reference object,” “Static scenario”).
- Image Prompt: Labels assigned to each input image to define its role:
- Canvas: The background or image to edit on.
- Asset: The object to insert or reference.
- Control: A depth map, edge map, or mask.

Figure 8 illustrates why this matters. In the top row, changing the image prompt from [Canvas, Asset] to [Canvas, Control] completely changes how the second image is used. In the bottom row, changing the Context Prompt from “Synthetic” to “Realistic” dramatically shifts the artistic style of the output. This hierarchy gives the model precise control over how to use its inputs.
Data Construction: Learning from the World
A model is only as good as its data. The problem with image editing is that large-scale, high-quality “before and after” datasets are rare. Hand-drawing masks or Photoshopping millions of images is impossible.
UniReal leverages Videos as a scalable source of truth. Real-world videos naturally contain the “dynamics” the model needs to learn.

The researchers built an automated pipeline (Figure 3) to harvest training data from raw videos:
- Video Frame2Frame: They take two random frames from a video. The difference between them (camera movement, subject rotation, lighting change) serves as a training example for “editing.” An LLM generates a caption describing the change.
- Video Multi-object: Using segmentation models (like SAM 2), they cut out objects from one frame and ask the model to generate a different frame containing those objects. This teaches customization and composition.
- Video Control: They extract depth maps or edge maps from video frames to create training pairs for controllable generation.
By mixing this scalable video data with existing datasets (like InstructPix2Pix), UniReal learns robust physical interactions (shadows, perspective) that are often missing from synthetic editing datasets.
Table 1 breaks down the data sources. Notice the heavy reliance on the newly constructed video datasets (bottom block), comprising millions of samples.

Experiments and Capabilities
UniReal demonstrates state-of-the-art performance across several difficult tasks. Let’s look at the results.
1. Instructive Image Editing
This task involves changing an image based on a text command (e.g., “Add an elephant”).

In Figure 4, compare UniReal (far right) to competitors like InstructPix2Pix or OmniGen.
- Row 1 (Elephant): UniReal integrates the elephant into the water realistically, including reflections and submerged parts. Other models often paste the elephant on top like a sticker.
- Row 2 (Remove Toy): UniReal completely removes the yellow duck and hallucinates the missing rocks behind it perfectly.
- Row 3 (Ants): The prompt “Small ants lift up the car” is complex. UniReal actually renders ants interacting with the car, whereas other models struggle to visualize the interaction.
2. Quantitative Evaluation
The researchers backed up these visuals with hard numbers. They used metrics like CLIP (to measure how well the image matches the text) and DINO (to measure visual quality).

As shown in Table 2, UniReal consistently scores higher on CLIP_out (output quality) and CLIP_dir (directional edit accuracy) compared to specialized editors like UltraEdit or generalists like OmniGen.
3. Image Customization & Object Insertion
This is arguably the hardest task in generative AI: taking a specific object (not just “a dog,” but “this specific toy dog”) and putting it in a new context.

In Figure 5, look at the first row. The task is to put the blue “Tuetan” can onto a table of fruit.
- UniReal preserves the text on the can and its metallic texture while lighting it correctly for the new scene.
- OmniGen and others often distort the logo or lose the shape of the can.
The bottom row shows Multi-Subject Composition (fighting plushies). UniReal manages to keep the identity of both toys distinct, while other models often blend their features together.
4. Comparison with AnyDoor
AnyDoor is a popular model specifically designed for object insertion. However, it usually requires user-provided masks to tell it where to put the object.

Figure 7 highlights a major advantage of UniReal: it doesn’t need masks. It infers the location from the text.
- Row 1 (Dog): The model puts the dog in the pool. Notice the refraction in the water. AnyDoor struggles to blend the dog’s fur with the water surface.
- Row 3 (Person): UniReal matches the lighting and color tone of the background perfectly, making the person look like they were originally in the photo.
Why Video Data is the “Secret Sauce”
One of the most interesting findings in the paper is the ablation study regarding training data. The researchers trained a version of UniReal only on the Video Frame2Frame dataset—meaning it never saw explicit “editing” instructions, just pairs of video frames.

As Figure 9 shows, the model trained only on video (third column) can already perform tasks like adding a dog or changing a car’s color. This proves the hypothesis: learning the natural variations between video frames teaches a model how to edit images. The full model (last column) is sharper and more obedient, but the core capability comes from the video data.
Emergent Capabilities and Future Potential
Because UniReal was trained on such a diverse set of tasks and data, it exhibits “emergent capabilities”—skills it wasn’t explicitly trained for but can perform by combining its knowledge.

Figure 10 (Right Block) showcases these novel abilities:
- Multi-object Insertion: Adding multiple distinct items (toys, backpacks) into a scene seamlessly.
- Local Reference Editing: Transferring a hairstyle from a reference image to a target person without changing their face.
- Layer-aware Editing: Placing an object “behind” existing objects (like the elephant appearing behind the fence) without needing a depth map input.
Conclusion
UniReal represents a significant step toward a “General Vision Intelligence.” By moving away from specialized, task-specific pipelines and embracing a unified, video-inspired framework, it solves multiple problems at once.
The key takeaways for students and researchers are:
- Unified Architectures Win: A single powerful transformer (DiT) with full attention can outperform specialized UNets if the data is structured correctly.
- Video is a Super-Signal: Static images are limited. Videos contain the physics, lighting, and 3D understanding required for realistic editing.
- Prompt Engineering is Architectural: The Hierarchical Prompt system shows that how we ask the model to do something is just as important as the model itself.
UniReal suggests a future where we won’t need a “Background Remover” and a “Style Transfer-er.” We will simply have a visual engine that understands the dynamics of the world, ready to edit reality frame by frame.
](https://deep-paper.org/en/paper/2412.07774/images/cover.png)