Have you ever tried to get an AI image generator to create something exactly as you pictured it? Maybe you wanted to capture the distinctive art style of a niche painter, the rough texture of a vintage fabric, or the precise golden-hour lighting of a photograph you love. You type out a detailed prompt, but the words just don’t quite capture the nuance. You think, “If only I could show it what I mean.”
That gap between words and visuals is a fundamental limitation of current generative AI models. They can turn text into striking images, but language alone is a blunt instrument for describing fine visual details. Meanwhile, models that can learn from images—so-called subject-driven generators—typically copy concrete elements, like a specific person or object, into a new scene. They often fail to carry over abstract elements such as mood, pose, material, or artistic style.
A recent research paper, “DREAMOMNI2: MULTIMODAL INSTRUCTION-BASED EDITING AND GENERATION,” tackles this problem directly. The researchers introduce DreamOmni2, a unified framework that understands instructions from both text and reference images. This “multimodal” approach lets users guide AI image editing and generation with extraordinary precision—beyond swapping objects—to include abstract attributes like texture, lighting, composition, and overall style.
Figure 1: DreamOmni2 can handle a broad range of creative tasks, enabling users to combine concepts from multiple images to edit or generate new ones.
This article explores how DreamOmni2 works—how it creates its multimodal dataset, how the framework learns to handle multiple reference images, and how it performs compared to leading open-source and commercial models.
The Limits of Words and Pictures Alone
Before understanding DreamOmni2’s solution, let’s revisit the two dominant techniques it improves upon:
Instruction-based Editing – Models like InstructPix2Pix can transform a photo using text commands such as “turn the apple into an orange.” This works well for simple edits, but struggles with complex visual descriptions—how do you precisely describe a dress’s intricate pattern or a painter’s brushstroke quality using only words?
Subject-driven Generation – Models such as DreamBooth or IP-Adapter learn subjects from one or more photographs, allowing you to create new images featuring that subject (like your dog on the moon). However, these methods focus on tangible subjects rather than abstract traits like style, pose, or lighting conditions.
DreamOmni2 bridges this gap with two new multimodal tasks:
- Multimodal Instruction-based Editing: Modify an image using both text and visual references. Example: “Make the bag in the first image have the same leather texture as the jacket in the second.”
- Multimodal Instruction-based Generation: Create a new image guided by a prompt and multiple reference images. Example: “Generate a person with the pose from image 1, wearing the outfit from image 2, under the lighting of image 3.”
But to train a model to do this, the researchers faced a major challenge: how do you build a dataset containing source images, reference images, textual instructions, and target results—when no such datasets exist?
Building the DreamOmni2 Data Factory
The team designed a clever three-stage data synthesis pipeline to construct the enormous, high-quality dataset needed for multimodal editing and generation.
Figure 2: DreamOmni2’s data factory operates in three stages. Each builds upon the previous one to create complex multimodal training examples.
Stage 1: Extracting Concrete and Abstract Concepts with Feature Mixing
First, they needed to train a model to “extract” elements such as objects or abstract attributes (texture, style, pose) from an image.
Previous approaches used a diptych method—generating two images side-by-side—but this reduced resolution and caused artifacts along the dividing line. DreamOmni2 introduces a feature mixing scheme, mixing attention features between two generation branches. This process encourages the model to produce paired images that share a selected attribute while remaining distinct—for instance, a source image of a cat and a target image of a dog both rendered in “Van Gogh style.”
Feature mixing produces full-resolution, clean image pairs that accurately convey shared attributes. These paired samples train an extraction model to recognize and reproduce both concrete objects and abstract properties.
Stage 2: Synthesizing Multimodal Editing Data
Once trained, the extraction model powers the creation of multimodal editing data:
- Start with a Target Image: Choose a final image, like “a plush toy dog on a beach.”
- Create a Reference Image: Use the extraction model to isolate the attribute—e.g., extract a “plush material” texture.
- Create a Source Image: Apply an instruction-based editing model to alter the concept (turn the plush dog into a metal one).
- Generate the Instruction: Use a large language model (LLM) to write a natural instruction describing the edit, such as “Make the dog in the image have the same plush material as in the reference.”
Each tuple—source image, reference image, text instruction, and target image—becomes a single training example.
Stage 3: Synthesizing Multimodal Generation Data
The third stage extends this setup for generation. The extraction model creates multiple reference images capturing different attributes—objects, textures, lighting—from earlier examples. For instance, given references of a “metal dog,” “beach,” and “plush texture,” the model learns to generate a “plush dog on a beach.”
The resulting dataset covers both concrete objects (people, pets, furniture, accessories) and abstract attributes, divided into local attributions (hairstyle, material, facial expression) and global attributions (style, lighting, color).
Figure 3: DreamOmni2’s dataset spans a diverse mix of concrete objects and abstract attributes, creating a versatile training base for both editing and generation.
The DreamOmni2 Framework
Creating the dataset solved one problem, but enabling the model to process multiple image inputs required rethinking the architecture. Standard Diffusion Transformer (DiT) models struggle to distinguish which part of an instruction refers to which image. DreamOmni2 introduces two critical innovations.
1. Index Encoding and Position Encoding Shift
When users refer to “image 1” or “image 2” in prompts, the model needs clear indexing. DreamOmni2 introduces Index Encoding, adding a tag to identify each image’s position in the input set. This helps the model interpret complex multimodal instructions accurately.
However, this alone can lead to “copy-paste” artifacts between images—where the model confuses spatial positions. DreamOmni2 offsets the position encoding for each reference image, ensuring the model treats them as distinct visual spaces. Together, these encodings eliminate cross-image blending and preserve clean, context-aware composition.
2. Joint Training with a Vision-Language Model (VLM)
Real-world user instructions can be ambiguous or inconsistent. To make the system robust, the authors jointly train DreamOmni2 with a Vision-Language Model (VLM), specifically Qwen2.5-VL. The VLM interprets messy, natural-language prompts and reformats them into precise inputs the generation model can understand.
This joint training drastically improves the system’s comprehension of complex, multimodal user intents, yielding more accurate edits and generations.
Evaluating DreamOmni2
Existing benchmarks like DreamBooth and OmniContext don’t test the new multimodal scenarios DreamOmni2 supports. The researchers therefore created the DreamOmni2 Benchmark, comprising real-world images and test cases covering both editing and generation of abstract and concrete concepts.
Table 1: DreamOmni2 fills a key gap in evaluation benchmarks, uniquely supporting multimodal editing and generation with multiple references and abstract attributes.
Multimodal Instruction-Based Editing Performance
DreamOmni2 was tested against several models, including open-source systems (DreamO, OmniGen2, Qwen-Image-Edit) and closed-source commercial ones (GPT-4o, Nano Banana).
Figure 4: DreamOmni2 demonstrates clean, accurate edits and strong adherence to multimodal instructions compared to competing methods.
Table 2: Quantitative comparisons for editing show DreamOmni2 outperforming all open-source models and matching or exceeding commercial systems in human evaluations.
Human evaluators found DreamOmni2 produced the most precise edits, outperforming even GPT-4o and Nano Banana. Other models often introduced unintended alterations or color tints—issues easily caught by humans but missed by automated scoring systems.
Multimodal Instruction-Based Generation Performance
The story is similar for generation tasks. DreamOmni2 outperformed open-source models and even matched the visual quality of GPT-4o while exceeding Nano Banana’s consistency.
Figure 5: In generation tasks, DreamOmni2 successfully merges elements from multiple images, maintaining stylistic coherence and precision.
Table 3: DreamOmni2 achieves generation quality on par with GPT-4o and surpasses other models across every evaluation metric.
Why These Innovations Matter: Ablation Studies
To validate the importance of each architectural element, the team conducted systematic ablation tests, removing one component at a time.
- Joint VLM Training: As shown below, jointly training the VLM and generation model (Scheme 4) achieves far better results than running them separately, proving the benefit of tight coupling.
- Encoding Schemes: Including both Index Encoding and Position Encoding Shift yields the best multimodal handling performance.
Table 5: Ablation results confirm that both index and position encoding are essential for optimal performance when processing multiple reference images.
Conclusion: Toward Truly Intuitive AI Creation
DreamOmni2 marks a significant leap in making generative AI a more expressive and intuitive tool. By integrating both text and image as instructions, it moves beyond vague word-based interaction and enables precise visual guidance.
The model’s dual innovations—a robust three-stage data synthesis pipeline and an architecture built for multi-image understanding—allow it to handle concepts ranging from concrete subjects to subtle abstract qualities like lighting or texture.
With DreamOmni2, we move closer to an era of creativity where you can simply tell an AI, “Make it look like this,” and it truly understands what you mean.