Beyond Text-to-Image: How Do We Evaluate AI That Tells Stories with Pictures?

Introduction

Imagine you ask an AI to write a tutorial on “How to bake sourdough bread.” You don’t just want a wall of text; you want step-by-step instructions interleaved with photos of the dough rising, the scoring pattern, and the final golden loaf. Or perhaps you want an AI to generate a children’s book where the text and illustrations flow together naturally on every page.

This capability is known as interleaved text-and-image generation. While we have mastered text generation (thanks to LLMs like GPT-4) and made massive strides in image generation (Stable Diffusion, DALL-E), combining them into a single, coherent narrative stream is a frontier challenge.

However, there is a major bottleneck in this field: Evaluation.

How do you grade a multimodal story? If you use standard text metrics, you ignore the images. If you use image metrics, you ignore the story coherence. Existing benchmarks are often too simple, focusing on single-image outputs or rigid formats.

In this deep dive, we explore a new research paper titled “Holistic Evaluation for Interleaved Text-and-Image Generation.” The researchers introduce a comprehensive benchmark (InterleavedBench) and a novel evaluation framework (InterleavedEval) that uses GPT-4o as a sophisticated judge. Their work sheds light on the current state of multimodal AI and provides a roadmap for building models that can truly “show and tell.”

Overview of InterleavedBench and InterleavedEval.

The Problem with Current Multimodal Evaluation

Before diving into the solution, we need to understand why evaluating these models is so difficult.

The “Apples and Oranges” Problem

In traditional NLP, we use metrics like BLEU or ROUGE to compare a machine’s translation to a human reference. In Computer Vision, we use FID (Fréchet Inception Distance) to judge image realism.

But interleaved generation is messy. The output might have text, then an image, then more text, then two images. The model needs to decide where to place images and what those images should depict based on the narrative flow. Standard metrics are “unimodal”—they only look at one side of the coin.

Limitations of Existing Benchmarks

Most existing benchmarks focus on Text-to-Image (T2I) generation. The input is a prompt (“A cat sitting on a mat”), and the output is a single image.

Interleaved generation is different. It requires:

Arbitrary sequences: Inputs and outputs can be any mix of text and images.
Context awareness: An image generated in step 3 must be consistent with the character introduced in step 1.
Instruction following: The model must adhere to complex, multi-step prompts.

As shown below, previous benchmarks (left) focused on simple composition. The new InterleavedBench (right) demands a cohesive multi-step workflow, such as a cooking tutorial where visual context changes step-by-step.

Comparison between existing benchmarks and InterleavedBench.

Introducing InterleavedBench

To properly test modern Large Multimodal Models (LMMs), the researchers curated InterleavedBench. This is the first benchmark designed specifically for holistic evaluation of arbitrary sequences of text and images.

Dataset Composition

The benchmark consists of 815 high-quality instances divided into two subsets:

Context-Based Subset: The model is given a sequence of text and images (the “context”) and must continue the sequence. This tests the model’s ability to maintain consistency.
Context-Free Subset: The model receives only a text instruction and must generate the entire interleaved article (text + images) from scratch. This tests creativity and planning.

Diverse Use Cases

One of the benchmark’s strengths is its diversity. It doesn’t just look at one type of task; it covers 10 diverse use cases ranging from educational content to marketing materials.

Examples of InterleavedBench use cases including script generation and storytelling.

As illustrated above, the tasks mimic real-world applications:

Multimodal Script Generation: Creating “WikiHow” style guides.
Visual Story Completion: Continuing a narrative about a family vacation.
Marketing Material: creating ads that mix copy and product shots.
Report Generation: Summarizing data with text and charts.

The distribution of these tasks ensures that models are tested on various domains, from factual reporting to creative storytelling.

Distribution of use cases in InterleavedBench.

How It Compares

The table below highlights the gap InterleavedBench fills. While other benchmarks like MagicBrush or DreamBench focus on editing or single images, InterleavedBench is unique in requiring multiple output images interleaved with text based on detailed instructions.

Table comparing InterleavedBench to existing benchmarks.

The Metric: InterleavedEval

Having a dataset is only half the battle. You also need a way to score the results.

Human evaluation is the gold standard, but it is slow and expensive. Traditional automated metrics (like computing vector similarity) often fail to capture nuances like “is this image helpful for this specific paragraph?”

The researchers propose InterleavedEval, a reference-free metric powered by GPT-4o. The idea is to use a state-of-the-art LMM as a judge. The evaluator is given the input instruction, the model’s output, and specific criteria, and is asked to provide a score (1-5) and an explanation.

The Five Pillars of Evaluation

To make the evaluation “holistic,” the researchers broke down quality into five distinct dimensions. This prevents a model with great text but terrible images (or vice versa) from getting a misleadingly high score.

Definitions of the five evaluation aspects: Text Quality, Perceptual Quality, Image Coherence, Text-Image Coherence, and Helpfulness.

Text Quality: Is the text clear, grammatical, and hallucination-free?
Perceptual Quality: Do the images look real? Are there artifacts or distortions?
Image Coherence: Do the images look like they belong together? If “Alice” is in image 1, does the person in image 3 look like Alice? (This is notoriously hard for AI).
Text-Image Coherence (TIC): Does the image actually illustrate the text it accompanies?
Helpfulness: Does the overall content actually solve the user’s problem or follow the instruction?

Experiments and Results

The researchers tested several leading models using this new framework. They categorized models into two types:

Integrated Models: Single neural networks designed to handle both text and images (e.g., GILL, EMU-2, MiniGPT-5).
Pipeline Models: Systems that chain a strong LLM (like GPT-4o or Gemini) with a separate Image Generator (like DALL-E 3 or SDXL).

Key Findings

The results were revealing. Despite the hype surrounding “native” multimodal models, the pipeline approach currently dominates.

Automatic evaluation results showing GPT-4o + DALL-E 3 outperforming others.

As seen in the table above:

GPT-4o + DALL-E 3 achieved the highest scores across almost every category.
Gemini 1.5 + SDXL followed closely behind.
Integrated models (GILL, EMU-2) struggled significantly, particularly with “Helpfulness” and “Text Quality.”

Why do Pipeline Models Win?

The qualitative analysis suggests that pipeline models succeed because they separate the planning from the painting.

The LLM generates the text narrative and writes descriptions (captions) for where images should go.
The Image Generator creates visuals based on those specific captions.

Integrated models often try to do everything at once and get “confused,” leading to irrelevant text or images that don’t match the story.

The “Image Coherence” Bottleneck

Even the best models struggled with Image Coherence. Notice in the results table that Image Coherence scores are generally lower than Perceptual Quality.

It is easy for DALL-E 3 to make a beautiful image (Perceptual Quality). It is very hard for it to ensure the character in generated image #2 has the exact same clothes and facial features as the character in generated image #1. This lack of “visual memory” remains a major open problem in the field.

Visual Case Studies

Let’s look at the actual outputs to understand these scores.

Qualitative comparison of model outputs.

In the figure above:

Row 1 (GILL): The prompt asks about a doe protecting her fawn. GILL generates text about a totally different topic (machine learning concepts), showing a complete failure of instruction following.
Row 2 (EMU-2): The prompt is about removing banana stains. EMU-2 repeats “soak the fabric” but provides low-quality images.
Row 4 (GPT-4o + DALL-E 3): The model generates a coherent story about a “Hidden Library.” The text is engaging, and the images are high quality, though the artistic style shifts slightly between images.

Performance by Task

The researchers also broke down performance by specific use cases.

Radar charts showing performance across different tasks.

The radar charts reveal that Sequential Image Editing (changing an image step-by-step) is the hardest task for almost all models. This is because it requires strict adherence to the previous image’s structure, which pipeline models (generating images from scratch each time) struggle to do.

Is the Metric Reliable? (Meta-Evaluation)

How do we know InterleavedEval (using GPT-4o) is actually accurate? The researchers compared the automated scores against human ratings.

Correlation between automatic metrics and human judgment.

The table above shows the Spearman correlation with human judgment.

InterleavedEval-GPT-4o (Bold): Shows the highest correlation across almost all categories, particularly Text Quality (0.72) and Helpfulness (0.57).
Traditional Metrics (BERTScore, CLIPScore): Show very low or even near-zero correlation with human perception of quality.

This validates that using an LMM as a judge is a far superior method for evaluating these complex, open-ended tasks than relying on mathematical similarity scores.

Conclusion and Future Implications

The “Holistic Evaluation for Interleaved Text-and-Image Generation” paper makes a critical contribution to the multimodal AI landscape. By establishing InterleavedBench, the authors have provided the community with a rigorous testing ground that reflects real-world needs—tutorials, stories, and reports—rather than simple captioning tasks.

Furthermore, InterleavedEval proves that we can automate the grading of these complex outputs using advanced LLMs, achieving results that align closely with human intuition.

Key Takeaways for Students:

Pipelines are currently King: If you need to build an interleaved generation application today, chaining a smart LLM with a smart Image Generator is better than using a single end-to-end model.
Coherence is the Challenge: The next big breakthrough will likely be in “Image Coherence”—giving generative models a working memory so visual elements persist across a story.
Metrics Matter: You cannot improve what you cannot measure. Moving away from BLEU/FID toward holistic, aspect-based evaluation is essential for progress in Generative AI.

This research highlights that while AI can now “write and draw” simultaneously, teaching it to tell a consistent, helpful, and visually cohesive story is a challenge that is just beginning to be solved.

Introduction#

The Problem with Current Multimodal Evaluation#

The “Apples and Oranges” Problem#

Limitations of Existing Benchmarks#

Introducing InterleavedBench#

Dataset Composition#

Diverse Use Cases#

How It Compares#

The Metric: InterleavedEval#

The Five Pillars of Evaluation#

Experiments and Results#

Key Findings#

Why do Pipeline Models Win?#

The “Image Coherence” Bottleneck#

Visual Case Studies#

Performance by Task#

Is the Metric Reliable? (Meta-Evaluation)#

Conclusion and Future Implications#

Key Takeaways for Students:#