Introduction
We are living in the golden age of generative AI. With the advent of diffusion models, we can conjure vivid worlds from a single sentence. But as the technology matures, the focus is shifting from simple generation (creating an image from scratch) to editing (modifying an existing image).
Imagine you have a photograph of a living room and you want to “add a vase to the table” or “change the dog into a cat.” It sounds simple, but evaluating whether an AI model has done this job well is notoriously difficult.
Why? Because image editing is subjective. If you ask a model to “make her smile,” there are infinite ways to execute that command. Furthermore, a good edit isn’t just about following the instruction; it’s about what you don’t change. You want the person’s identity, the background lighting, and the surrounding context to remain exactly the same.

As shown in Figure 1, both edited images technically satisfy the prompt “Make her smile.” However, the image on the right changes the woman’s shirt, hair texture, and background details. To a human observer, the left image is clearly the superior edit because it maintains consistency.
Until recently, the only reliable way to measure this was to ask humans. But human evaluation is slow, expensive, and unscalable.
In this post, we will dive deep into HATIE (Human-Aligned Text-guided Image Editing), a new benchmark proposed by researchers at Seoul National University. They have developed a comprehensive framework that automates the evaluation of image editing models, using a massive dataset and a suite of metrics that align surprisingly well with human perception.
The Problem: The Lack of a “Golden” Standard
In traditional machine learning tasks, like classifying an image as a “cat” or “dog,” we have a clear ground truth. You get a 1 or a 0.
Text-guided image editing is different. There is no single “correct” pixel-perfect output for the prompt “change the background to a beach.” This lack of a golden ground truth forces researchers to rely on proxy metrics.
Commonly, researchers use CLIP (Contrastive Language-Image Pre-training) to measure how well the image matches the text. However, CLIP only measures semantic similarity; it doesn’t care if the person’s face was distorted or if the background was accidentally recolored.
Previous attempts to create benchmarks have been limited in scale or scope. As seen in the comparison table below, prior benchmarks typically contained only a few hundred images or lacked fully automated evaluation pipelines.

HATIE aims to solve this by scaling up significantly—providing over 18,000 images and nearly 50,000 editing queries—and, crucially, by automating the “human” element of evaluation.
The HATIE Framework
The core contribution of this research is a holistic framework that mimics how a human evaluates an image. A human doesn’t just look at one aspect; they look at the whole picture. Did the object change correctly? Did the background stay the same? Is the image realistic?
The researchers designed HATIE to replicate this multi-faceted assessment.

As illustrated in Figure 2, the workflow is cyclical. It starts with a curated dataset and specific editing queries. These are fed into an editing model. The output is then rigorously analyzed across five specific criteria, which are finally aggregated into a total score based on weights learned from actual human feedback.
Let’s break down the three main components of this engine: the Dataset, the Query Generation, and the Evaluation Metrics.
1. The Dataset
You cannot evaluate editing without high-quality source images. The researchers utilized the GQA dataset, a foundational dataset used for Visual Question Answering. GQA is unique because it doesn’t just provide images; it provides scene graphs.
A scene graph is a structured representation of the image: “Object A (person) is holding Object B (cup) standing in front of Object C (wall).”
This rich metadata is essential. If you want to automate editing queries, you need to know what is actually in the image. The researchers filtered this dataset to ensure quality, removing objects that were too small, occluded, or ambiguous.

The resulting dataset is highly balanced, covering 76 object classes ranging from people and animals to vehicles and household items (Figure 3). This ensures that a model isn’t just tested on how well it edits “dogs,” but how well it handles a diverse range of real-world scenarios.
2. Automating “Feasible” Queries
One of the cleverest parts of HATIE is how it generates the text prompts (queries) for editing. Randomly generating prompts can lead to nonsense, like asking a model to “put a car on the kitchen table.”
The researchers categorized editing tasks into two main buckets:
- Object-Centric: Modifications to specific items (Addition, Removal, Replacement, Attribute Change, Resizing).
- Non-Object-Centric: Global changes (Background Change, Style Change).
To ensure feasibility, they used the statistical data from the scene graphs. For an Object Addition task, the system looks at what objects usually appear together. If the scene is a living room, the system suggests adding a “cushion” to the “sofa” because those objects frequently co-occur in the data.

They also utilized Large Language Models (LLMs) like Llama 3 and GPT-4 to turn these structured intents into natural language captions. This allows HATIE to support both Description-based models (which need a “before” caption and an “after” caption) and Instruction-based models (which just take a command like “swap the cat for a dog”).
3. The 5 Pillars of Evaluation
This is the heart of the paper. How do we mathematically quantify “good editing”? The researchers propose that a good edit must satisfy three broad criteria: Quality, Fidelity (did it do what I asked?), and Consistency (did it leave everything else alone?).
They break this down further into 5 distinct scores.
I. Image Quality (IQ)
First, the image must look real. If the edit introduces artifacts or noise, it fails. The researchers measure this using Fréchet Inception Distance (FID), a standard metric in generative AI that compares the distribution of generated images to real images.

They normalize this score so that 1.0 represents perfect quality.
II. Object Fidelity (OF)
Did the model successfully edit the target object? If the prompt was “change the car to red,” is the car actually red?
To measure this, HATIE uses a combination of:
- CLIP Alignment: Checks if the visual content of the object matches the text description.
- Detection Confidence: Can an object detection model still find the object? (If you tried to turn a car red but turned it into a red blob, detection would fail).
- Size Scoring: For resizing tasks, did the object actually get bigger or smaller?
These sub-scores are combined using weights (\(w\)) that we will discuss later.

III. Background Fidelity (BF)
This applies when the user explicitly asks to change the background (e.g., “change the background to a library”). It uses CLIP alignment on the background region of the image to ensure the new setting matches the text prompt.
IV. Object Consistency (OC)
This is critical. If you “change the car to red,” the car’s shape, model, and orientation should remain identical. Only the color should change.
HATIE measures this by comparing the object in the original image and the edited image using three powerful perceptual metrics:
- LPIPS: Measures perceptual similarity (how humans see differences).
- DINO: A vision transformer that is great at understanding semantic structure.
- L2 Distance: Pixel-level differences.

V. Background Consistency (BC)
If the edit is focused on an object, the background must not change. HATIE masks out the object and compares the background of the original and edited images using the same LPIPS, DINO, and L2 metrics.
Task-Specific Workflows
Crucially, not all edits use all metrics. For an Object Removal task, measuring “Object Consistency” makes no sense (the object should be gone!). The framework dynamically adjusts the evaluation workflow based on the task type.

Figure 5 visualizes this complexity. Notice how different tasks (like Object Addition vs. Attribute Change) trigger different chains of detection, cropping, and measurement.
Aligning with Human Perception
We have all these math equations, but how do we know they matter? The researchers didn’t just guess the weights for these formulas. They conducted a massive User Study.
They showed thousands of image pairs to human participants and asked them to judge the edits based on specific criteria (e.g., “Which image better preserves the background?”).
They then used the data from these human choices to “tune” the weights in their equations. They optimized the weights so that the HATIE score would predict the human choice as often as possible.

The result? A set of automated metrics that correlate highly with human judgment.

As shown in Figure 8, the correlation (r-value) between HATIE scores and Human User scores is remarkably high, particularly for Object Consistency (0.98) and Object Fidelity (0.94). This validates that HATIE is a reliable proxy for human evaluation.
Experimental Results
The researchers utilized HATIE to benchmark several state-of-the-art models, including Prompt-to-Prompt (P2P), Imagic, and InstructPix2Pix.
The Fidelity-Consistency Trade-off
One of the most interesting findings from the benchmark is the inherent tension between making a change (Fidelity) and keeping the image the same (Consistency).
Most editing models have a “strength” parameter (often denoted as \(\tau\) or \(s_T\)) that controls how aggressive the edit is.

Figure 6 illustrates this trade-off perfectly for the Prompt-to-Prompt model.
- As \(\tau\) increases (weaker edit), the Consistency scores (green lines) go up—the image looks more like the original.
- However, the Fidelity scores (blue lines) drop—the model fails to make the requested change.
- The Total Score (red line) peaks in the middle, identifying the “sweet spot” where the model balances the edit with preservation.
Visualizing the Metrics
Let’s look at what these scores represent visually using InstructPix2Pix.

In Figure 7, look at the top row (Motorcycle). As the edit strength (\(s_T\)) increases from 2.5 to 12.5:
- The red motorcycle successfully turns cream-colored.
- Object Fidelity rises from 0.623 to 0.676.
However, look at the third row (Person on cart -> Ancient Ruins). As the strength increases, the background successfully changes to ruins, but the person and horse (Object Consistency) start to degrade and distort. The metrics capture this degradation precisely, with Object Consistency dropping from 0.985 to 0.624.
The Leaderboard
So, which model is the best? The researchers tested both description-based models (requiring input/output text) and instruction-based models.

According to Table 4:
- MagicBrush performs exceptionally well among instruction-based models, achieving the highest Total Score (0.7329) and Image Quality.
- Prompt-to-Prompt (P2P) is a strong all-rounder for description-based models, balancing fidelity and consistency well.
- Some models, like Imagic, excel at preserving consistency (high Background Consistency scores) but sometimes struggle to implement strong changes (lower Fidelity).
The benchmark also breaks down performance by edit type.

The radar charts in Figure 9 (labeled Figure IX in the deck) reveal specific strengths. For example, Imagic (blue line, left chart) is very strong at Style Changes but weaker at Object Replacement. MagicBrush (green line, right chart) dominates almost every category for instruction-based models.
Visualizing Success and Failure
Finally, it is helpful to look at qualitative examples to see how the metrics align with visual output.

In Figure VII (labeled VII in the deck), we can see how different models handle the same prompt.
- Row 1 (Sports Ball -> Knife): The model IP2P creates a very realistic knife, reflected in a high Object Fidelity score (0.869).
- Row 3 (Background -> Library): Notice how difficult this is. Changing the water background to a library while keeping the people swimming is visually jarring. Most models struggle here, resulting in low consistency scores. InstDiff manages to keep the original content best (lowest change), but fails to generate a convincing library, illustrating the difficulty of the task.
Conclusion
Evaluating generative AI is no longer just about “vibes.” As these tools enter professional workflows, we need rigorous, quantifiable ways to measure their performance.
HATIE represents a significant leap forward in this domain. By combining a large-scale, high-quality dataset with a multi-faceted evaluation pipeline that mirrors human perception, it moves us away from subjective guessing and toward objective benchmarking.
Key takeaways from this paper include:
- Complexity: Image editing cannot be scored by a single number. We must separately measure Fidelity (the change) and Consistency (the preservation).
- Trade-offs: Current models still struggle to maximize both fidelity and consistency simultaneously; there is usually a cost to one when improving the other.
- Human Alignment: Automated metrics are only useful if they agree with human perception. HATIE’s weight-tuning approach ensures this alignment.
For students and researchers entering the field of computer vision, HATIE provides a robust standard. It allows for fair comparison between new methods and offers deep insights into where current models fail—paving the road for the next generation of intelligent image editors.
](https://deep-paper.org/en/paper/2505.00502/images/cover.png)