Introduction
Imagine walking into a messy living room. You don’t just see a “sofa,” a “cat,” and a “remote.” You instantly understand the web of connections: the cat is sleeping on the sofa, the remote is under the cushion, and the painting is hanging on the wall. This structured understanding of objects and their relationships is what computer vision researchers call a Scene Graph.
Scene Graph Generation (SGG) is a pivotal task in AI, bridging the gap between raw pixel data and high-level language description. It transforms an image into a structured graph where nodes are objects and edges are relationships (predicates). This structure is essential for downstream tasks like robotic navigation (“robot, pick up the cup on the table”) or assisting the visually impaired.
However, SGG models face a significant hurdle: the real world is messy and unevenly distributed. While training datasets are massive, they cannot cover every possible interaction between every possible object. A model might see thousands of examples of a “man riding a horse” but zero examples of a “woman carrying a towel.” This phenomenon, known as the underrepresentation issue, leads to AI that is great at stating the obvious but terrible at recognizing rare or novel relationships.
In this deep dive, we will explore a fascinating research paper by Yuxuan Wang and Xiaoyuan Liu from Nanyang Technological University. They propose a novel framework that integrates the vast, “common sense” knowledge of pre-trained Vision-Language Models (VLMs) into SGG tasks. Crucially, they solve a major problem that comes with this integration: bias.
We will break down how they use sophisticated mathematical estimation to “reverse engineer” the hidden biases of pre-trained models and combine them with task-specific models to achieve state-of-the-art results.
The Core Problem: Underrepresentation in SGG
To understand why this research is necessary, we must first understand the limitations of traditional SGG models. Most existing models are trained from scratch on datasets like Visual Genome. These datasets are plagued by a long-tail distribution.
In simple terms, a few relationship types (like “on,” “wearing,” “has”) appear constantly, while specific actions (like “painting,” “eating,” “carrying”) appear much less frequently. Even worse, the combinations of Subject-Predicate-Object (triplets) are exponentially diverse. A training set might contain “Man eating apple” but not “Boy eating pear,” even though the relationship is semantically identical.

As shown in Figure 1, there is a massive gap in prediction quality based on how well-represented a triplet is in the training data.
- Left Side (Well Represented): Common triplets like “Woman carrying Bag” have high confidence scores (0.87).
- Right Side (Less Represented/Unseen): Rare triplets like “Woman carrying Towel” have abysmal scores (0.05).
The model simply hasn’t seen enough examples to learn these rare connections effectively. This is where the researchers propose a shift in strategy: instead of relying solely on the SGG dataset, why not borrow the brain of a model that has “read” the entire internet?
Enter Vision-Language Models (VLMs)
Vision-Language Models (VLMs), such as ViLT or Oscar, are trained on colossal amounts of image-text pairs from the web. They possess a broad, generalized understanding of how visual concepts relate to language. The researchers hypothesized that these models possess the “common sense” required to handle the underrepresented triplets that stump standard SGG models.
The idea is to use a Zero-Shot VLM. “Zero-shot” means using the pre-trained model directly without fine-tuning it on the specific dataset. We can query the VLM using a prompt like:
“The {subject} is [MASK] the {object}.”
The VLM then fills in the [MASK] with a relationship word (predicate). Because the VLM has seen millions of concepts during pre-training, it should theoretically handle “Woman carrying towel” easily, even if the SGG dataset lacks it.
The Catch: Predicate Bias
It sounds like a perfect solution, but there is a catch. Direct inference using a zero-shot VLM introduces severe Predicate Bias.
VLMs are not blank slates; they have their own biases derived from the massive, uncontrolled datasets they were pre-trained on. If the VLM saw the word “near” millions of times more than “carrying” during its pre-training, it will be statistically biased toward predicting “near,” regardless of what is in the image.
This creates a mismatch. The distribution of relationships in the SGG task (\(\pi_{sg}\)) is likely very different from the distribution in the VLM’s pre-training data (\(\pi_{pt}\)). To make matters worse, we don’t know \(\pi_{pt}\). The pre-training data is often proprietary or too massive to analyze, meaning the bias is a “black box.”
Methodology: The Proposed Architecture
The researchers propose a two-branch architecture that combines a specific SGG model with a general VLM, solving the bias issue mathematically.

As illustrated in Figure 2, the framework consists of two pathways:
- The Task-Specific Branch (\(f_{sg}\)): This is a VLM that has been fine-tuned specifically on the Scene Graph dataset. It knows the specific classes and requirements of the SGG task but suffers from the underrepresentation issue.
- The Zero-Shot Branch (\(f_{zs}\)): This is the frozen, pre-trained VLM. It has extensive general knowledge but suffers from unknown predicate bias.
The system takes an image, crops the regions of interest (the subject and object), and feeds them into both branches.
\[ \left\{ \begin{array} { c } { \mathbf { o } _ { \mathrm { z s } } ^ { k } = f _ { \mathrm { z s } } ( z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) \in \mathbb { R } ^ { K } } \\ { [ \mathbf { o } _ { \mathrm { s g } } ^ { 0 } , \mathbf { o } _ { \mathrm { s g } } ^ { k } ] = f _ { \mathrm { s g } } ( z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) \in \mathbb { R } ^ { K + 1 } , } \end{array} \right. \](See equation in images/003.jpg)
Note that the Zero-Shot branch (\(f_{zs}\)) outputs logits for the \(K\) relationship classes, while the Fine-Tuned branch (\(f_{sg}\)) outputs \(K+1\) classes (including the “background” or “no relationship” class). Since the pre-trained model doesn’t understand the concept of a “background” class specific to this dataset, the system relies on the fine-tuned model for that specific determination.
Innovation 1: LM Estimation for Predicate Debiasing
The heart of this paper is how the authors tackle the bias in the Zero-Shot branch. They employ a technique called Post-hoc Logits Adjustment.
The Math of De-biasing
Ideally, we want the model’s prediction to reflect the target distribution of the test set (\(P_{ta}\)), not the biased training distribution (\(P_{tr}\)). Based on Bayes’ rule, there is a relationship between the probability predicted by a model trained on biased data and the optimal probability for the target environment.
\[ \frac { P _ { \mathrm { t r } } ( r | z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) } { P _ { \mathrm { t r } } ( r ) } = \frac { P _ { \mathrm { t a } } ( r | z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) } { P _ { \mathrm { t a } } ( r ) } \](See equation in images/006.jpg)
By rearranging this, we can adjust the output logits (the raw prediction scores before Softmax) to remove the training bias and inject the target prior. The adjustment formula looks like this:
\[ \hat { \mathbf { o } } ^ { k } ( r ) = \mathbf { o } ^ { k } ( r ) - \log P _ { \mathrm { t r } } ( r ) + \log P _ { \mathrm { t a } } ( r ) \](See equation in images/008.jpg)
Here, \(\hat { \mathbf { o } } ^ { k }\) is the debiased logit. The equation effectively subtracts the log-probability of the training distribution (\(P_{tr}\)) and adds the log-probability of the target distribution (\(P_{ta}\)).
The Missing Variable
For the Fine-Tuned branch, this is easy. We know \(P_{tr}\) because we have the SGG training set (\(\pi_{sg}\)).
For the Zero-Shot branch, however, we do not know \(P_{tr}\) (which is \(\pi_{pt}\), the pre-training distribution). This is the key roadblock the authors overcome.
Lagrange-Multiplier (LM) Estimation
Since \(\pi_{pt}\) is unknown, the authors estimate it using Constrained Optimization. They assume that if they can find a distribution \(\pi_{pt}\) that minimizes the error on a validation set (from the SGG dataset), that distribution effectively represents the bias of the pre-trained model.
They formulate this as an optimization problem: Find \(\pi_{pt}\) that minimizes the Cross-Entropy loss (\(R_{ce}\)) between the adjusted logits and the ground truth labels, subject to the constraint that \(\pi_{pt}\) must sum to 1 (making it a valid probability distribution).
\[ \begin{array} { r l } & { \displaystyle \pi _ { \mathrm { p t } } = \ \underset { \pi _ { \mathrm { p t } } } { \mathrm { a r g m i n } } \ R _ { c e } \big ( \mathbf { o } ^ { k } - \log \pi _ { \mathrm { p t } } + \log \pi _ { \mathrm { s g } } , \ r \big ) , } \\ & { \displaystyle s . t . \pi _ { \mathrm { p t } } ( r ) \geq 0 , \ \mathrm { f o r } r \in \mathcal { C } _ { r } , \ \sum _ { r \in \mathcal { C } _ { r } } \pi _ { \mathrm { p t } } ( r ) = 1 ( \ r } \end{array} \](See equation in images/009.jpg)
To solve this, they use the Lagrange Multiplier method, transforming the constrained problem into an unconstrained one:
\[ \begin{array} { l } { \pi _ { \mathrm { p t } } = \underset { \pi _ { \mathrm { p t } } } { \mathrm { a r g m i n } } \underset { \lambda _ { r } \geq 0 , v } { \mathrm { m a x } } R _ { c e } - \displaystyle \sum _ { r } \lambda _ { r } \pi _ { \mathrm { p t } } ( r ) } \\ { \qquad + v ( 1 - \displaystyle \sum _ { r } \pi _ { \mathrm { p t } } ( r ) ) } \end{array} \](See equation in images/010.jpg)
This mathematical trick allows them to approximate the hidden pre-training distribution \(\pi_{pt}\) without ever seeing the original pre-training data.
Visualizing the Estimated Bias
Does this estimation actually work? The authors plotted the distributions to verify.

Figure 3 shows the SGG training set distribution (Blue) compared to the estimated pre-training distributions for two VLMs, ViLT (Orange) and Oscar (Green). The differences are stark. For example, look at the predicate “wearing.” It is moderately frequent in the training set but dominates the estimated distribution for ViLT. If this bias weren’t corrected, ViLT would over-predict “wearing” constantly. This validates the necessity of the LM Estimation.
Innovation 2: Certainty-Aware Ensemble
Once both the Zero-Shot model (\(f_{zs}\)) and the Fine-Tuned model (\(f_{sg}\)) are debiased, the final step is to combine them.
The authors observed that the two models have different strengths. The Zero-Shot model is often better at broad, semantic relationships, while the Fine-Tuned model is better at dataset-specific quirks. To leverage this, they use a dynamic ensemble strategy.
They calculate a Confidence Score (conf) for each model based on the maximum probability it assigns to a non-background class.
(See equation in images/012.jpg)
Using these scores, they compute a dynamic weight \(W_{cer}\). If the SGG model is much more confident than the Zero-Shot model, \(W_{cer}\) will be high, and the final prediction will rely mostly on the SGG model. If the SGG model is unsure (common with rare triplets), the weight shifts toward the Zero-Shot VLM.
\[ \begin{array} { r l r } & { } & { P _ { \mathrm { e n s } } ( r | z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) = W _ { \mathrm { c e r } } * \hat { P } _ { \mathrm { s g } } ( r | z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) } \\ & { } & { + \left( 1 - W _ { \mathrm { c e r } } \right) * \hat { P } _ { \mathrm { z s } } ( r | z _ { i } , z _ { j } , \mathbf { x } _ { i , j } ) ( 1 } \end{array} \](See equation in images/013.jpg)
Experiments and Results
The researchers tested their method on the Visual Genome dataset, using two different VLMs (ViLT and Oscar) as the backbone. They compared their method against several state-of-the-art SGG baselines.
The Metrics
They focused on two main metrics:
- Recall@K: Standard accuracy metric.
- Mean Recall@K (mRecall): This calculates recall for each category separately and averages them. This is the critical metric for the “long-tail” problem because performing well on rare classes is necessary to get a high mRecall.
Quantitative Performance
The results show a clear victory for the proposed method.

In Table 2, we see the standard Recall results. The rows marked with “+ Ours” show the performance when the debiased Zero-Shot VLM is ensembled with the baseline.
- ViLT ft (Fine-Tuned): 34.9 mRecall@20.
- ViLT ft + Ours: 35.3 mRecall@20.
- Oscar ft: 36.7 mRecall@20.
- Oscar ft + Ours: 37.3 mRecall@20.
While these gains might look modest, in the highly competitive field of SGG, they are significant. However, the real story is in the Mean Recall (Table 1 below), which highlights performance on the long tail.

In Table 1, the gains are even more pronounced. For example, ViLT ft-la + Ours achieves a mRecall@100 of 46.5, significantly outperforming the baseline of 44.5. This proves that the method is particularly effective at pulling up the performance of those difficult, rare categories.
The “Unseen” Triplets
The most compelling evidence comes from looking at triplets that were never seen during training. This is the ultimate test of the “common sense” provided by the Zero-Shot VLM.

Table 3 breaks down accuracy for “All” triplets versus “Unseen” triplets. Look at the Ens. Gain (Ensemble Gain) column.
- For All triplets, the gain is moderate (~2-3%).
- For Unseen triplets, the gain is massive (~5-6%).
This confirms the core hypothesis: The debiased Zero-Shot model successfully transfers knowledge to the SGG task, specifically handling the rare and unseen scenarios that the fine-tuned model fails to grasp.
Conclusion
The integration of Vision-Language Models into specific tasks like Scene Graph Generation is a promising frontier, but it is not as simple as plugging one model into another. This paper identifies a critical hidden variable—predicate bias in pre-training—and solves it using an elegant mathematical approach: Lagrange-Multiplier Estimation.
By estimating the unknown bias of the pre-trained model and correcting it via logit adjustment, the authors allow the SGG system to utilize the vast, generalized knowledge of VLMs without being misled by their pre-training skew. The dynamic, certainty-aware ensemble ensures that the system gets the best of both worlds: the specificity of a fine-tuned model and the broad understanding of a generic one.
This “training-free” method (in terms of the Zero-Shot branch) offers a computationally efficient way to significantly boost performance, particularly for the rare, “long-tail” events that make the real world so difficult for AI to understand. It is a compelling example of how mathematical rigor can unlock the full potential of large foundation models.
](https://deep-paper.org/en/paper/2403.16184/images/cover.png)