The Devil is in the Data: How CoMM Is Fixing Multimodal AI Generation

If you’ve ever tried to get an AI to write a coherent picture book or a step-by-step tutorial with consistent illustrations, you’ve likely noticed a problem. While modern Multimodal Large Language Models (MLLMs) are great at describing a single image or generating a single picture from text, they often struggle to tell a continuous story. The characters change appearance between panels, the logic skips steps, or the text and images just don’t seem to talk to each other.

Why does this happen? The answer usually lies in the “Garbage In, Garbage Out” principle. Most models are trained on massive datasets scraped from the web that are noisy, disjointed, and lack narrative flow.

In this post, we are diving into a research paper that tackles this exact problem. The researchers introduce CoMM, a high-quality dataset designed to teach AI models how to be coherent, consistent, and logically sound when generating interleaved image-text content.

The Problem: Quantity Over Quality

To understand why CoMM is necessary, we first need to look at the limitations of existing datasets like MMC4 and OBELICS. These datasets are massive, containing billions of image-text pairs scraped from the web. However, “big” doesn’t always mean “good.”

The researchers identified three critical flaws in current datasets:

  1. Narrative Incoherence: The text doesn’t flow logically from one step to the next.
  2. Entity Inconsistency: Visual elements change randomly (e.g., a “blue sofa” in step 1 becomes a “red chair” in step 2).
  3. Data Scarcity in Image Modality: Most documents in existing datasets contain only one or two images, which isn’t enough to teach a model how to handle long-form visual sequences.

Visualization of the image-sentence numbers per document distribution of three datasets.

As shown in Figure 2 above, look at the distribution of images per document. The graphs for MMC4 and OBELICS (top left and top right) are heavily skewed toward the bottom left corner, meaning most documents have very few images (Median = 1 or 2).

In contrast, the CoMM dataset (bottom graph) shows a much healthier distribution. It has a median of 4 images per document and a significant number of documents with many more images. This density is crucial for teaching models how to maintain context over a longer sequence.

Enter CoMM: Constructing a Better Dataset

So, how do you build a dataset that fixes these problems? The researchers didn’t just scrape more data; they curated and filtered it with extreme precision.

1. Sourcing High-Quality Raw Data

Instead of scraping random websites, the team focused on sources known for structured narratives, specifically:

  • Instructional Content: Websites like WikiHow, where coherence is mandatory (Step 1 must lead to Step 2).
  • Visual Storytelling: Platforms dedicated to stories, ensuring narrative flow.

Figure 5 below illustrates the diversity of topics covered in CoMM, ranging from gardening and cooking to technology and relationships.

Topic visualization of the CoMM dataset showing diverse categories like Gardening, Cooking, and Technology.

2. The Multi-Perspective Filter Strategy

Raw data is rarely perfect. To polish it, the researchers developed a three-stage filtering pipeline using advanced AI models.

A. Text Sequence Filter They utilized Large Language Models (LLMs) like Llama-3 to read the text of a document and score it on development and coherence. If the text was disjointed or nonsensical, the document was discarded.

B. Image Sequence Filter This is where the math gets interesting. The goal is to ensure images are visually consistent (they look like they belong together) but also show progress (they aren’t just duplicates).

To achieve this, they defined a metric using CLIP (a vision-language model) embeddings.

Equation 1: The Image Sequence Filter Metric.

Let’s break down this equation:

  • The First Term (Positive): Calculates the similarity between consecutive images (\(x_i\) and \(x_{i-1}\)). This rewards smooth transitions.
  • The Second Term (Negative): Calculates the similarity between all pairs of images. This penalizes the sequence if all images look exactly the same.

By maximizing this score, the filter selects image sequences that are visually coherent but distinct enough to tell a story.

C. Image-Text Alignment Filter Finally, they used MLLMs to ensure the images actually match the text descriptions, filtering out documents where the visuals were irrelevant to the instructions.

Training for Preferences: The DPO Edge

The researchers didn’t stop at just cleaning the data. They also created a Preference Dataset to further tune models using reinforcement learning techniques, specifically Direct Preference Optimization (DPO).

In this setup, the “positive” sample is the original, high-quality document. To create “negative” samples (examples of what not to do), they cleverly manipulated the data:

  • Shuffled Text: Keeping images fixed but mixing up the text order.
  • Shuffled Images: Keeping text fixed but mixing up the image order.
  • Shuffled Steps: Randomizing the order of the entire image-text pairs.

This teaches the model that order and alignment matter—Step 1 must come before Step 2, and the image must match the text next to it.

The Results: Does It Work?

The evaluation of CoMM was comprehensive, covering dataset quality, downstream task performance, and new generation benchmarks.

1. Dataset Quality Comparison

The researchers used GPT-4o and Llama-3 to “grade” the datasets on quality metrics.

Table 1: Quality comparison of interleaved image-text datasets.

As seen in Table 1, CoMM significantly outperforms MMC4 and OBELICS across all metrics:

  • Development (DLP): The narrative flow is stronger.
  • Completeness (CPL): The documents feel more finished.
  • Image-Text Alignment (ITA): The pictures actually match the words.
  • Image Sequence Quality (ImgS): The visuals are far more consistent.

2. Boosting Downstream Tasks

One of the best ways to test a dataset is to train a model on it and see if it gets smarter at other tasks. The researchers trained a baseline model (OpenFlamingo) using CoMM and compared it to versions trained on MMC4 and OBELICS.

Table 2: Performance comparison on downstream few-shot tasks.

Table 2 highlights the results on tasks like Visual Question Answering (VQA) and Image Captioning (COCO). The model trained on CoMM (bottom rows) consistently beats the baselines, especially in “few-shot” settings where the model has to learn from just a handful of examples. This proves that CoMM improves the model’s in-context learning capabilities.

3. Qualitative Analysis: The “Eye Test”

Numbers are great, but in multimodal AI, seeing is believing. Let’s look at how models trained on different datasets handle a generation request.

Figure 3: Visualization of interleaved image-text content generation comparisons.

In Figure 3, we see a comparison between models trained on MMC4 (left panels) and CoMM (right panels).

  • The Failure: When asked to generate instructions for “Cinnamon Apple Chips” or “Plastic Bottle Planters,” the MMC4-trained models often fail to produce relevant images or stop generating text entirely.
  • The Success: The CoMM-trained models generate step-by-step instructions with relevant, consistent images that actually look like a helpful guide.

We can also look at visual storytelling specifically.

Figure 10: Comparison of storytelling visualization between SEED-Llama trained on MMC4 vs CoMM.

In Figure 10, the difference in narrative style is stark. The CoMM model (right) generates a coherent story about “The Girl Who Traveled the World,” complete with distinct chapters and illustrative images. The MMC4 model (left) produces a disjointed product description that lacks narrative structure.

New Benchmarks for a New Era

Because existing benchmarks weren’t designed for this level of interleaved generation, the researchers proposed four new tasks to standardize future evaluations:

  1. Image-to-Text Sequence Generation: Given a sequence of images, write the story.
  2. Text-to-Image Sequence Generation: Given a story, generate the illustrations.
  3. Continuation: Given the first half of a document, finish it.
  4. Question-based Generation: Generate a full tutorial based on a user query (e.g., “How do I bake a cake?”).

They benchmarked top models like MiniGPT-5, SEED-Llama, and Emu2 on these tasks.

Table 3: Comparison of performance among different models on the four generation tasks.

Table 3 provides the baseline scores for these new tasks. Interestingly, different models excel at different things—MiniGPT-5 is good at text metrics (ROUGE), while Emu2 often scores higher on image quality (FID). This highlights that we still have work to do to build a truly “all-around” multimodal generator.

Finally, the researchers demonstrated the power of their Preference Dataset using Direct Preference Optimization (DPO).

Table 4: Performance results of SEED-Llama trained by DPO.

As shown in Table 4, applying DPO (the rows labeled “+ DPO”) significantly boosts performance across almost all metrics compared to the standard model. This confirms that teaching the model what not to do (via negative samples) is just as important as showing it what to do.

Conclusion

The CoMM dataset represents a shift in how we approach multimodal AI training. It moves away from the “bigger is better” mentality and embraces “cleaner and more coherent is better.”

By focusing on instructional and storytelling content, and applying rigorous filtering to ensure logical and visual consistency, CoMM provides a blueprint for the next generation of MLLMs. These models won’t just be able to recognize a cat in a photo; they’ll be able to tell you a story about the cat, show you how to care for it step-by-step, and keep the cat looking like the same cat from start to finish.

For students and researchers entering this field, the takeaway is clear: Data quality is the bottleneck. Improvements in model architecture are important, but without coherent data like CoMM, even the best algorithms will struggle to tell a convincing story.