Introduction
“Beauty is in the eye of the beholder.” It is a phrase we have heard a thousand times, implying that aesthetic judgment is inherently subjective. Yet, in the world of Computer Vision and Artificial Intelligence, we have spent years teaching machines to understand beauty by averaging the opinions of the masses. This approach, known as Generic Aesthetics Assessment (GAA), works well for determining if a photo is generally “high quality”—is it in focus? Is the lighting good? Is the composition standard?
But what happens when we move beyond technical quality to personal preference? One person might prefer the rugged, stark composition of a brutalist building, while another prefers the soft, chaotic colors of an impressionist garden. When an AI trained on the “average” opinion tries to predict these unique preferences, it often fails. This field is called Personalized Aesthetics Assessment (PAA).
The prevailing methods for PAA have a fundamental flaw: they try to build a personalized house on a generic foundation. They pre-train models on massive datasets of “average” opinions and then try to tweak them slightly for the individual. A recent paper, “Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification,” argues that this foundation is cracked. The researchers propose a completely new paradigm—PAA+—that reimagines how machines learn subjective taste, using the complex and highly subjective domain of human physique aesthetics as a testing ground.
In this deep dive, we will explore why the old methods fail mathematically, how the PAA+ paradigm works, and the novel architecture designed to understand the 3D nuance of the human body.
The Problem: The “Voting Paradox” in AI
To understand why current AI struggles with personal taste, we have to look at how they are trained. The prevailing PAA paradigm typically follows two stages:
- Pre-training: A model is trained on a large dataset where many people vote on images. The model learns the “average” score.
- Fine-tuning: The model is adjusted slightly using a small set of images rated by a specific user.
The researchers identified a critical issue in Stage 1, rooted in social choice theory: the Voting Paradox (or Condorcet’s paradox).
Collective Preferences vs. Individual Rationality
In a generic dataset, we aggregate the preferences of many individuals. However, collective preferences often fail to satisfy “transitivity.” Transitivity is a logical rule: if you like A better than B, and B better than C, you must like A better than C.

As illustrated in Figure 1, imagine three annotators ranking three images (A, B, and C).
- Annotator 1 prefers A > B > C.
- Annotator 2 prefers B > C > A.
- Annotator 3 prefers C > A > B.
When we aggregate these distinct, rational individual views into a “collective” view (the GAA model), we get a cycle: The group prefers A over B, B over C, and C over A. This results in A > B > C > A.
This cycle is irrational. When a generic AI model is pre-trained on this contradictory data, it learns a “confused” representation of aesthetics. Using this confused model as a starting point (backbone) for learning your specific, rational taste is counter-productive. This is the first major challenge the paper addresses: Is a Generic Aesthetics Model a superior choice for pre-training? The answer is no.
Other Limitations in the Status Quo
Beyond the voting paradox, the authors highlight two other significant gaps:
- Static Surveys: Current methods might ask a user their age or gender to help “personalize” the results. However, preferences evolve. Static surveys done once at the beginning cannot capture changing tastes or specific preferences about the object itself (e.g., “I like high-contrast lighting” vs. “I am a 25-year-old male”).
- Wasted Feedback: In the prevailing paradigm, once the model is deployed, user feedback is rarely used to update the model in real-time. The learning stops just when it should be getting started.
The Core Solution: The PAA+ Paradigm
To solve these issues, the authors propose PAA+, a three-stage paradigm designed to respect individual differences from the very beginning of the training process.

Figure 2 provides a clear comparison between the old and new approaches.
Stage 1: Personalized Pre-training
Instead of training one massive “Generic” model that suffers from the voting paradox, the PAA+ paradigm trains multiple expert models based on distinct personality types. By grouping data based on consistent aesthetic preferences (using MBTI personality types as a proxy), the models learn coherent, transitive aesthetic representations. This eliminates the “confused” prior knowledge found in generic models.
Stage 2: Fine-tuning
This stage remains similar to the traditional approach but is more effective because the starting point is better. A user selects (or is matched with) the pre-trained expert model that closest resembles their personality. Then, the model is fine-tuned with the user’s specific data.
Stage 3: Continual Learning
This is a crucial addition. PAA+ introduces a loop where the model constantly refines itself based on user interactions. As the user rates predictions or provides feedback, the model updates, ensuring it adapts to drifting preferences over time.
Exemplification: Physique Aesthetics Assessment
To prove this paradigm works, the researchers chose a challenging domain: Physique Aesthetics Assessment (PhysiqueAA).
Why physique? Because it is incredibly subjective. One person might admire the bulk of a bodybuilder, while another admires the lean lines of a long-distance runner. Furthermore, physique is not just about a 2D image; it involves 3D geometry, posture, and health, making it a complex computer vision problem.
The PhysiqueFrame Architecture
The researchers developed a specific framework called PhysiqueFrame to handle this task.

As shown in Figure 3, the framework consists of two primary networks working in tandem:
- PANet (Physique Analysis Network): Extracts objective features from the image.
- PENet (Preference Extraction Network): Understands the user’s subjective desires.
Let’s break these down.
1. PANet: Seeing the Body in 3D
Standard image assessment models look at pixels. However, judging a physique requires understanding body shape and bone structure. PANet uses two modules to achieve this:
A. Mesh Perceiving Module (MPM) This module extracts a 3D mesh of the human body from the 2D image. It creates a mathematical representation of the body’s surface. However, 3D mesh data can be irregular and sparse. To handle this, the authors utilize a normalization technique based on affine geometry.
The transformation of local points on the mesh is calculated using this equation:

Here, the module normalizes point clouds while preserving geometric properties like rotation and scaling. This allows the network to “understand” the volume and shape of the person in the image, regardless of their position in the frame.
B. Posture Analyzing Module (PAM) Aesthetics is also about grace and pose. The PAM treats the human skeleton as a graph, where joints are nodes and bones are edges.

As detailed in Figure 5, the system uses a Keypoints Decoder. It combines visual features from the image (via Swin Transformer) with the geometric location of joints. These are fed into a Graph Convolutional Network (GCN). This allows the model to analyze the “flow” of a pose—essential for judging things like dance or athletic form.
2. PENet: Understanding the User via LLMs
While PANet analyzes the image, PENet analyzes the user. It takes multimodal feedback—text descriptions (“I like slender builds”), survey results, or even audio—and processes them using a Large Language Model (LLaVA++).
Using a Chain of Thought (CoT) approach, the LLM decomposes complex user feedback into quantifiable scores on factors like “Style,” “Shape,” and “Expressiveness.” This turns abstract human preference into mathematical vectors that the visual model can use to adjust its predictions.
The PhysiqueAA50K Dataset
To train such a specific system, the researchers needed data that didn’t exist. They created PhysiqueAA50K, the first large-scale dataset for personalized physique aesthetics.

Figure 4 gives an overview of this massive undertaking:
- Scale: Over 50,000 images covering diverse sports and activities (Yoga, Bodybuilding, Dance).
- Annotation: They didn’t just ask random people to vote. They used a Human-AI collaborative approach.
- They generated pseudo-labels using AI.
- 16 Experts, each representing one of the 16 MBTI personality types, reviewed and corrected these labels.
- Dimensions: Each image is rated on Appearance, Health, and Posture.
This rigorous data collection ensures that the “Pre-training” stage of PAA+ has high-quality, distinct personality profiles to learn from, rather than a muddy average.
Experiments and Results
Did the new paradigm actually work? The results suggest a resounding yes.
1. Beating the “Average” (Stage 1 & 2 Validation)
The researchers compared models pre-trained on Generic data (GAA) vs. models pre-trained on Personality-Matched data (Ours). They tested this on three specific user profiles (ISFJ, ESFJ, ISTJ).

Table 1 shows the comparison. You can see the metrics:
- S (SRCC) & L (LCC): Correlation coefficients (higher is better).
- A (Accuracy): Binary classification accuracy.
- Sat: Satisfaction rate.
In almost every category, the Ours column outperforms the GAA column. For example, looking at the User-ISFJ Appearance score, the correlation jumps from 0.650 (GAA) to 0.699 (Ours). This proves that starting with a personality-aligned model is far superior to starting with a generic one.
2. The Power of Personalized Surveys
The study also analyzed how much the “Survey” (the user’s stated preferences) influenced the result.

Figure 6 illustrates this qualitatively. The model was given three silhouettes (A, B, C).
- When the survey said “I prefer a slender physique,” Silhouette A (slender) got high scores (8.8).
- When the survey changed to “I prefer a curvy physique,” Silhouette A’s score dropped (6.5), and Silhouette C (curvy) rose significantly.
- This demonstrates the model’s ability to fundamentally shift its aesthetic criteria based on user input.
3. Continual Learning Works
One of the most exciting results came from Stage 3: the continuous feedback loop.

Figure 7 plots user satisfaction over time. For all three distinct users (ESFJ, ISTJ, ISFJ), satisfaction rates climbed steadily as the model underwent more “Update Epochs.” This confirms that the model isn’t static; it effectively learns from ongoing interactions.
Figure 8 visualizes this loop. In Epoch 1, the user says, “I prefer a slender, graceful physique.” In Epoch 2, the model adjusts the scores for subsequent images (represented by the red and green arrows), aligning future predictions with that feedback.
4. Where is the AI Looking? (Saliency Maps)
Finally, to prove that PhysiqueFrame is actually looking at the body and not just the background scenery, the researchers used GradCAM to visualize the model’s attention.

In Figure 9, compare “Ours” (top row) with other leading models like NIMA or MaxViT. The heatmaps for “Ours” are tightly focused on the limbs and torso of the subjects. Other models often get distracted by the floor or background elements. This focus is a direct result of the PANet architecture designed specifically for 3D body perception.
Conclusion and Implications
The “PAA+” paradigm represents a significant shift in how we think about subjective AI. By acknowledging that “collective” preference is often a mathematical fallacy (the voting paradox), the researchers have charted a new course for personalization.
The key takeaways are:
- Don’t Start Average: Pre-training on distinct personality clusters yields better results than pre-training on a massive, generic average.
- Context Matters: Understanding aesthetics requires understanding the subject. For physique, this meant building 3D-aware modules (PANet) rather than relying on standard 2D image processors.
- Never Stop Learning: The integration of a continual learning stage ensures that the AI grows with the user.
While this paper focused on physique, the implications extend to fashion, interior design, art recommendation, and any other field where “good” is subjective. Future AI won’t just tell us what is popular; it will understand specifically what we find beautiful.
](https://deep-paper.org/en/paper/file-2195/images/cover.png)