Have you ever tried to get an AI image generator to create something exactly as you pictured it? Maybe you wanted to capture the distinctive art style of a niche painter, the rough texture of a vintage fabric, or the precise golden-hour lighting of a photograph you love. You type out a detailed prompt, but the words just don’t quite capture the nuance. You think, “If only I could show it what I mean.”

That gap between words and visuals is a fundamental limitation of current generative AI models. They can turn text into striking images, but language alone is a blunt instrument for describing fine visual details. Meanwhile, models that can learn from images—so-called subject-driven generators—typically copy concrete elements, like a specific person or object, into a new scene. They often fail to carry over abstract elements such as mood, pose, material, or artistic style.

A recent research paper, “DREAMOMNI2: MULTIMODAL INSTRUCTION-BASED EDITING AND GENERATION,” tackles this problem directly. The researchers introduce DreamOmni2, a unified framework that understands instructions from both text and reference images. This “multimodal” approach lets users guide AI image editing and generation with extraordinary precision—beyond swapping objects—to include abstract attributes like texture, lighting, composition, and overall style.

A gallery of images showcasing the diverse capabilities of DreamOmni2, from swapping objects and outfits to matching artistic styles and poses.

Figure 1: DreamOmni2 can handle a broad range of creative tasks, enabling users to combine concepts from multiple images to edit or generate new ones.

This article explores how DreamOmni2 works—how it creates its multimodal dataset, how the framework learns to handle multiple reference images, and how it performs compared to leading open-source and commercial models.


The Limits of Words and Pictures Alone

Before understanding DreamOmni2’s solution, let’s revisit the two dominant techniques it improves upon:

  1. Instruction-based Editing – Models like InstructPix2Pix can transform a photo using text commands such as “turn the apple into an orange.” This works well for simple edits, but struggles with complex visual descriptions—how do you precisely describe a dress’s intricate pattern or a painter’s brushstroke quality using only words?

  2. Subject-driven Generation – Models such as DreamBooth or IP-Adapter learn subjects from one or more photographs, allowing you to create new images featuring that subject (like your dog on the moon). However, these methods focus on tangible subjects rather than abstract traits like style, pose, or lighting conditions.

DreamOmni2 bridges this gap with two new multimodal tasks:

  • Multimodal Instruction-based Editing: Modify an image using both text and visual references. Example: “Make the bag in the first image have the same leather texture as the jacket in the second.”
  • Multimodal Instruction-based Generation: Create a new image guided by a prompt and multiple reference images. Example: “Generate a person with the pose from image 1, wearing the outfit from image 2, under the lighting of image 3.”

But to train a model to do this, the researchers faced a major challenge: how do you build a dataset containing source images, reference images, textual instructions, and target results—when no such datasets exist?


Building the DreamOmni2 Data Factory

The team designed a clever three-stage data synthesis pipeline to construct the enormous, high-quality dataset needed for multimodal editing and generation.

A diagram showing the three-stage data construction pipeline: creating data for an extraction model, then using it to generate multimodal editing data, and finally creating multimodal generation data.

Figure 2: DreamOmni2’s data factory operates in three stages. Each builds upon the previous one to create complex multimodal training examples.

Stage 1: Extracting Concrete and Abstract Concepts with Feature Mixing

First, they needed to train a model to “extract” elements such as objects or abstract attributes (texture, style, pose) from an image.

Previous approaches used a diptych method—generating two images side-by-side—but this reduced resolution and caused artifacts along the dividing line. DreamOmni2 introduces a feature mixing scheme, mixing attention features between two generation branches. This process encourages the model to produce paired images that share a selected attribute while remaining distinct—for instance, a source image of a cat and a target image of a dog both rendered in “Van Gogh style.”

Feature mixing produces full-resolution, clean image pairs that accurately convey shared attributes. These paired samples train an extraction model to recognize and reproduce both concrete objects and abstract properties.

Stage 2: Synthesizing Multimodal Editing Data

Once trained, the extraction model powers the creation of multimodal editing data:

  1. Start with a Target Image: Choose a final image, like “a plush toy dog on a beach.”
  2. Create a Reference Image: Use the extraction model to isolate the attribute—e.g., extract a “plush material” texture.
  3. Create a Source Image: Apply an instruction-based editing model to alter the concept (turn the plush dog into a metal one).
  4. Generate the Instruction: Use a large language model (LLM) to write a natural instruction describing the edit, such as “Make the dog in the image have the same plush material as in the reference.”

Each tuple—source image, reference image, text instruction, and target image—becomes a single training example.

Stage 3: Synthesizing Multimodal Generation Data

The third stage extends this setup for generation. The extraction model creates multiple reference images capturing different attributes—objects, textures, lighting—from earlier examples. For instance, given references of a “metal dog,” “beach,” and “plush texture,” the model learns to generate a “plush dog on a beach.”

The resulting dataset covers both concrete objects (people, pets, furniture, accessories) and abstract attributes, divided into local attributions (hairstyle, material, facial expression) and global attributions (style, lighting, color).

Two pie charts showing the distribution of data for editing and generation tasks, broken down by concrete objects and local/global attributes, alongside example images.

Figure 3: DreamOmni2’s dataset spans a diverse mix of concrete objects and abstract attributes, creating a versatile training base for both editing and generation.


The DreamOmni2 Framework

Creating the dataset solved one problem, but enabling the model to process multiple image inputs required rethinking the architecture. Standard Diffusion Transformer (DiT) models struggle to distinguish which part of an instruction refers to which image. DreamOmni2 introduces two critical innovations.

1. Index Encoding and Position Encoding Shift

When users refer to “image 1” or “image 2” in prompts, the model needs clear indexing. DreamOmni2 introduces Index Encoding, adding a tag to identify each image’s position in the input set. This helps the model interpret complex multimodal instructions accurately.

However, this alone can lead to “copy-paste” artifacts between images—where the model confuses spatial positions. DreamOmni2 offsets the position encoding for each reference image, ensuring the model treats them as distinct visual spaces. Together, these encodings eliminate cross-image blending and preserve clean, context-aware composition.

2. Joint Training with a Vision-Language Model (VLM)

Real-world user instructions can be ambiguous or inconsistent. To make the system robust, the authors jointly train DreamOmni2 with a Vision-Language Model (VLM), specifically Qwen2.5-VL. The VLM interprets messy, natural-language prompts and reformats them into precise inputs the generation model can understand.

This joint training drastically improves the system’s comprehension of complex, multimodal user intents, yielding more accurate edits and generations.


Evaluating DreamOmni2

Existing benchmarks like DreamBooth and OmniContext don’t test the new multimodal scenarios DreamOmni2 supports. The researchers therefore created the DreamOmni2 Benchmark, comprising real-world images and test cases covering both editing and generation of abstract and concrete concepts.

A table comparing the DreamOmni2 benchmark to existing benchmarks like DreamBooth and OmniContext, showing that only DreamOmni2 covers editing, generation, multiple references, and abstract attributes.

Table 1: DreamOmni2 fills a key gap in evaluation benchmarks, uniquely supporting multimodal editing and generation with multiple references and abstract attributes.


Multimodal Instruction-Based Editing Performance

DreamOmni2 was tested against several models, including open-source systems (DreamO, OmniGen2, Qwen-Image-Edit) and closed-source commercial ones (GPT-4o, Nano Banana).

A visual comparison of different models performing the same editing tasks. DreamOmni2’s results are consistently more accurate and faithful to the prompt.

Figure 4: DreamOmni2 demonstrates clean, accurate edits and strong adherence to multimodal instructions compared to competing methods.

A table showing quantitative results for multimodal editing, where DreamOmni2 achieves the highest scores in human evaluations and is highly competitive with commercial models.

Table 2: Quantitative comparisons for editing show DreamOmni2 outperforming all open-source models and matching or exceeding commercial systems in human evaluations.

Human evaluators found DreamOmni2 produced the most precise edits, outperforming even GPT-4o and Nano Banana. Other models often introduced unintended alterations or color tints—issues easily caught by humans but missed by automated scoring systems.


Multimodal Instruction-Based Generation Performance

The story is similar for generation tasks. DreamOmni2 outperformed open-source models and even matched the visual quality of GPT-4o while exceeding Nano Banana’s consistency.

A visual comparison for image generation tasks. DreamOmni2 produces results that are more coherent and consistent with the multiple reference images.

Figure 5: In generation tasks, DreamOmni2 successfully merges elements from multiple images, maintaining stylistic coherence and precision.

A table showing quantitative results for multimodal generation. DreamOmni2 again leads the pack, especially in human evaluations.

Table 3: DreamOmni2 achieves generation quality on par with GPT-4o and surpasses other models across every evaluation metric.


Why These Innovations Matter: Ablation Studies

To validate the importance of each architectural element, the team conducted systematic ablation tests, removing one component at a time.

  • Joint VLM Training: As shown below, jointly training the VLM and generation model (Scheme 4) achieves far better results than running them separately, proving the benefit of tight coupling.
  • Encoding Schemes: Including both Index Encoding and Position Encoding Shift yields the best multimodal handling performance.

A table from the ablation study showing that the combination of index encoding and position encoding shift is critical for the best performance.

Table 5: Ablation results confirm that both index and position encoding are essential for optimal performance when processing multiple reference images.


Conclusion: Toward Truly Intuitive AI Creation

DreamOmni2 marks a significant leap in making generative AI a more expressive and intuitive tool. By integrating both text and image as instructions, it moves beyond vague word-based interaction and enables precise visual guidance.

The model’s dual innovations—a robust three-stage data synthesis pipeline and an architecture built for multi-image understanding—allow it to handle concepts ranging from concrete subjects to subtle abstract qualities like lighting or texture.

With DreamOmni2, we move closer to an era of creativity where you can simply tell an AI, “Make it look like this,” and it truly understands what you mean.