We often hear that AI can “see.” Computer vision models can identify a dog, a car, or a person in an image with superhuman accuracy. Generative models can create photorealistic scenes from scratch. But there is a subtle, artistic layer to photography that goes beyond just identifying objects: Composition.
Composition is the art of arranging visual elements within a frame to create coherence and aesthetic appeal. It is why a photo taken by a professional looks “right,” while the same scene shot by an amateur might look cluttered or unbalanced.
Can Artificial Intelligence understand this abstract concept? Can an AI look at an image and tell you if it uses the “Rule of Thirds” or a “Diagonal” layout?
A recent CVPR paper titled “Can Machines Understand Composition?” tackles this exact question. The researchers from Beijing University of Posts and Telecommunications realized that while AI is getting smarter, its grasp of artistic composition is surprisingly weak. To fix this, they introduced a massive new dataset (PICD) and a rigorous benchmark suite.
In this deep dive, we will explore how they taught machines to categorize composition, the limitations of current top-tier models (including Multimodal LLMs), and what this means for the future of computational photography.
The Problem: AI Confuses “What” with “How”
Before we analyze the solution, we must understand the gap in current research. There are two main types of AI models used in this field:
- Specialized Models: These are neural networks designed for specific tasks like image cropping or aesthetic scoring. They turn an image into a mathematical vector (an embedding) that represents its style.
- Multimodal Large Language Models (MLLMs): These are the heavy hitters like GPT-4V or LLaVA that can discuss images in natural language.
The researchers identified a critical flaw in both: these models struggle to separate Semantics (what is in the picture) from Composition (how it is arranged). For example, if you train a model on “centered composition,” and most training images happen to be centered flowers, the model might incorrectly learn that “flower” equals “centered.”
This issue stems largely from data. Previous datasets for composition were small, had noisy labels, or lacked diversity in scenes.

As shown in Table 1 above, prior datasets like KUPCP or CADB had limited categories and scale. The new dataset introduced in this paper, PICD, scales up to nearly 37,000 images and drastically improves label quality and scene diversity.
The Solution: The Photographic Image Composition Dataset (PICD)
To teach a machine composition, you first have to define it rigorously. The authors turned to classic art theory, specifically Kandinsky’s principles, to build a taxonomy based on two dimensions: Element Type and Arrangement.
1. The Anatomy of Composition
The researchers broke down composition into a matrix.
- The Elements (The “What”):
- Points: Small, distinct objects that grab attention.
- Lines: Slender objects or connected points that guide the eye.
- Shapes: Larger areas with boundaries.
- The Arrangements (The “How”):
- Common rules like Rule of Thirds, Centered, Diagonal, Horizontal, Vertical, Triangle, Curves (C, O, S), and more.
By crossing these elements with these arrangements, the researchers created a structured label system comprising 24 distinct composition categories.

Figure 2 illustrates this “periodic table” of composition. The green column on the left lists the arrangements (e.g., Rule of Thirds, Diagonal), while the top columns define whether the element is a point, line, or shape.
For example:
- Category 1 (P-RoT): A single point placed according to the Rule of Thirds.
- Category 19 (LS-S-Cur): A line or shape forming an S-Curve (like a winding road).
This granular approach allows the dataset to be incredibly precise. It doesn’t just say “this is a good photo”; it explains the geometric logic behind it.
2. What Do These Categories Look Like?
To visualize this, look at the sample images below. This diversity helps models learn that a “Diagonal” composition isn’t just about bridges or fences—it can be a line of food, a shadow, or a human limb.

3. Ensuring Quality and Diversity
Building PICD wasn’t just about scraping the web. The team used a multi-stage pipeline:
- Collection: Aggregated images from Unsplash, Flickr, and existing datasets (COCO, OpenImages).
- Scripting: Used object detection and line detection algorithms to automatically filter candidate images (e.g., “Find images with one small object in the center”).
- Expert Voting: This is the gold standard. Five photography experts reviewed the images. An image only made it into the dataset if at least three experts agreed on its label.
Crucially, they monitored Scene Diversity. If the “Centered” category was getting too many photos of dogs, the system would force the inclusion of other scenes like landscapes or architecture.

Figure 3 highlights this achievement. The blue bars represent the number of images, but the sienna (orange) line represents scene types. PICD (far right) maintains high scene diversity across almost all categories compared to previous datasets, ensuring models don’t overfit to specific objects.
The Benchmark: How to Test a Machine’s “Eye”
With the dataset in hand, the authors proposed a comprehensive benchmark to test both Specialized Models and MLLMs. They designed four distinct tasks.
Task I: Composition Triplet Distinction
This is the “One of these is not like the others” test. The model is given three images:
- Image A (Category X)
- Image B (Category X)
- Image C (Category Y)
The model must identify Image C as the outlier. This tests if the model can cluster images by composition regardless of content.
Task II: Robustness to Semantic Interference
This is the trickiest test.
- Image A: A Dog in the Center.
- Image B: A Cat in the Center.
- Image C: A Dog using the Rule of Thirds.
A model relying on semantics might group A and C together because they are both dogs. A model that understands composition will correctly group A and B because they are both centered.
The Metric: CDA
To score these tasks, the researchers proposed a new metric called Composition Discrimination Accuracy (CDA).

While the equation looks formal, the concept is simple: It is the percentage of triplets where the model correctly identifies the negative sample (the outlier). \(N\) is the number of triplets, and the function returns 1 if the prediction (\(n\hat{e}g\)) matches the truth (\(neg\)), and 0 otherwise.
Evaluation for MLLMs
Since MLLMs (like GPT-4V) can “talk,” they were given slightly different tasks, including counting elements and recognizing arrangement types via multiple-choice questions.

As seen in Figure 4, the MLLMs are asked direct questions like “Which of these three images has a composition that differs?” or “How many composition elements are there?”
Results: The Struggle is Real
The results of the benchmark were revealing. Despite the hype surrounding AI vision, understanding composition remains a significant hurdle.
1. Specialized Models Struggle with Diversity
The researchers tested various architectures (cropping models, aesthetic assessment models) on the PICD benchmark.

Figure 5 shows the performance (CDA) across different datasets. Notice how the scores drop for PICD (the purple bars) compared to easier datasets like KUPCP. This confirms that previous datasets were too easy or biased, giving us a false sense of security about AI capabilities. On the rigorous PICD dataset, most models hover around 0.40–0.48 accuracy, which is far from perfect.
The data in Table 2 digs deeper:

Look at the Task II (Semantic Interference) column. The scores are consistently lower than Task I. This proves that semantic interference is real. When the subject matter (e.g., a cat vs. a car) changes, the models get confused and lose track of the compositional structure.
2. A Discovery: CDA is a Valid Proxy
One technical win for the paper is the validation of their new metric, CDA. Typically, researchers use “mean Average Precision” (mAP) for retrieval tasks, but calculating mAP is computationally expensive and complex.

Figure 6 shows a strong positive correlation between the lightweight CDA metric and the heavy mAP metric. This means future researchers can use the simpler CDA metric to quickly tune their models, saving time and compute resources.
3. MLLMs Are Not Ready for Art Critics
Perhaps the most surprising result came from the Multimodal Large Language Models. You might expect these massive “brains” to handle composition easily. They did not.

Table 3 reveals the performance of models like LLaVA, InternVL, and Qwen-VL.
- Task I (Distinction): Most models scored near 0.33, which is essentially random guessing for a 3-choice question.
- Task III (Counting): They struggled to count compositional elements accurately.
- Task IV (Arrangement): They failed to reliably name the layout (e.g., Vertical vs. Horizontal).
This suggests that while MLLMs are great at describing objects, they lack a fundamental understanding of spatial relationships and geometric arrangements. They can see the “pixels,” but they don’t see the “picture.”
Conclusion: Bridging the Semantic Gap
The “Can Machines Understand Composition?” paper serves as a reality check for computer vision. It highlights a specific blind spot in current AI: the inability to decouple the content of an image from its structure.
The introduction of PICD is a major step forward. By providing a large-scale, diverse, and meticulously labeled dataset, the researchers have given the community the map it needs to navigate this territory.
Key Takeaways:
- Context Matters: Current AI models are easily distracted by what an object is, ignoring where it is placed.
- New Standard: PICD replaces smaller, noisier datasets, offering a robust ground truth for future research.
- Future Work: There is a need for new network architectures that explicitly model geometric relationships (points, lines, topology) rather than just learning patterns from pixels.
For students and researchers, this opens an exciting avenue. We don’t just need models that generate art; we need models that understand the principles of art. Only then can machines truly assist photographers in capturing that perfect shot.
](https://deep-paper.org/en/paper/file-1955/images/cover.png)