Imagine you are looking at a photo of a picnic. There are three wicker baskets, fifty red apples, and two small teddy bears. If someone asks you to “count the bears,” you instantly focus on the two toys and ignore the sea of apples. This ability to filter visual information based on language is intuitive for humans. However, for Artificial Intelligence, this is a surprisingly difficult task.
In the world of computer vision, this task is known as Zero-Shot Object Counting. The goal is to build a model that can count instances of any object category specified by a text description, without ever having been explicitly trained on that specific category.
Historically, models have struggled with this. They often get confused by the “majority class.” In our picnic example, standard models might see the mass of apples and just count those, regardless of whether you asked for bears, baskets, or ants.
In this post, we are doing a deep dive into a fascinating paper titled “T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting.” This research proposes a new framework that moves away from standard CLIP-based methods and instead leverages the rich, pixel-level knowledge embedded in Text-to-Image Diffusion Models (like Stable Diffusion) to achieve state-of-the-art counting performance.
We will break down the problem, the architecture, the mathematics behind their novel loss functions, and the results that suggest a major leap forward in how machines understand text and images together.
The Problem: Text Insensitivity in Vision Models
To understand why T2ICount is necessary, we first need to look at the limitations of current approaches. Most recent zero-shot counting methods rely on CLIP (Contrastive Language-Image Pre-training). CLIP is fantastic at matching an image to a caption globally. However, counting is a local, pixel-level task.
CLIP tends to focus on the most prominent objects in an image. If you have an image dominated by background objects (the majority class), a CLIP-based counter often gets “distracted” by them. It exhibits poor text sensitivity—it sees the image, it sees the text, but it fails to strictly align the counting process to the specific object mentioned in the text.
Let’s look at a concrete example from the paper to illustrate this failure.

In Figure 1 above, look at the top-left image. The ground truth (GT) is 2.0 bears. The image contains a few bears and many apples.
- CLIP-Count (Top Right): It predicts 42.41 objects. It completely ignores the “bear” prompt and counts the apples because they are visually dominant.
- VLCounter (Bottom Left): It predicts 25.04. Better, but still heavily influenced by the apples.
- T2ICount (Bottom Right): The proposed method correctly focuses only on the bears, predicting 2.0.
The authors identify that this issue stems from the fact that CLIP operates on global semantics. To count accurately, we need a model that understands fine-grained, local visual details and how they relate to text.
The Solution: Leveraging Diffusion Priors
This is where Diffusion Models come in. Text-to-Image models like Stable Diffusion are trained to generate images pixel-by-pixel based on text prompts. To do this, they must possess a deep, localized understanding of how a specific word (like “bear”) translates to specific pixels in an image.
The researchers behind T2ICount hypothesized that the internal feature maps of a pre-trained diffusion model contain the rich “prior knowledge” needed for accurate counting.
The Challenge of Efficiency
Using a diffusion model for counting isn’t straightforward. Standard diffusion works by iteratively denoising an image over many steps (e.g., 50 or 100 steps). Running this full process just to count objects would be incredibly slow and computationally expensive.
To make this practical, the authors propose using a Single Denoising Step. They add noise to the input image and ask the U-Net (the core of the diffusion model) to predict the noise in just one step. They then extract the feature maps from this single step.
However, efficiency comes at a cost. As we will discuss in the methodology section, a single step often results in weak alignment between the text and the image. The core contribution of this paper is how they fix that weak alignment.
Methodology: The T2ICount Framework
Let’s unpack the architecture. T2ICount is designed to extract features from a diffusion model, refine them to ensure they listen to the text prompt, and then output a density map. A density map is a standard technique in object counting where the model predicts a heatmap; summing up the values in the heatmap gives the total count of objects.
Here is the high-level overview of the system:

As shown in Figure 2, the process works as follows:
- Input: An image and a text prompt (e.g., “keyboard keys”).
- Encoders: The image is encoded into a latent space (using a VAE), and the text is encoded (using CLIP).
- Denoising U-Net: The system performs a single denoising step. It extracts multi-scale feature maps (\(F_1, F_2, F_3, F_4\)) from the U-Net’s decoder.
- HSCM: These features are passed through the Hierarchical Semantic Correction Module. This is the brain of the operation, refining features to match the text.
- Output: A counter head produces the final density map.
1. The Foundation: Single-Step Denoising
The framework uses the standard formulation of a diffusion forward process. Given an initial compressed image representation \(z_0\), noise is added to create \(z_t\).

In this equation:
- \(z_t\) is the noisy latent variable.
- \(\bar{\alpha}_t\) controls the noise schedule.
- \(\epsilon\) is the random noise.
The model tries to predict the noise \(\epsilon\) conditioned on the text. The authors use features from this attempt. However, because they only use one step (\(t=1\)) to keep it fast, the raw features are messy.
2. Visualization of the “One-Step” Problem
Why do we need a special correction module? Look at Figure 3 below (panels a-c).

Panels (a), (b), and (c) show the raw cross-attention maps from the diffusion model at different resolutions when using just a single step. Notice how blurry and inaccurate they are? They highlight the general area of objects, but they also highlight background noise and fail to distinctly separate the specific object class (like “pens” or “balls”) from the surroundings.
If we fed these features directly into a counter, the result would be inaccurate. This necessitates the Hierarchical Semantic Correction Module (HSCM).
3. Hierarchical Semantic Correction Module (HSCM)
The HSCM is designed to take those messy, multi-scale features and progressively clean them up. It works in a cascaded fashion, moving from smaller, high-level feature maps to larger, detailed ones.
At each stage, the module performs two key actions: Fusion and Correction.
Feature Fusion: First, features from the previous (smaller) level are upsampled and combined with the current level.

Here, \(F'_i\) is the fused feature map. It combines the upsampled features from the deeper layer (\(V_{i+1}\)) with the current layer (\(F_i\)).
Semantic Enhancement & Correction: This is the critical step. The model computes a Text-Image Similarity Map (\(S_i\)). This map tells the model: which pixels actually match the text description?

This simple dot product between the visual features \(V_i\) and the text embedding \(c'\) creates a heatmap of relevance.
The module then uses this similarity map to “correct” the features. It forces the model to pay more attention to the regions that have high similarity to the text.

As shown in the equation above, the features \(F'_i\) are enhanced by adding the previous features weighted by the similarity map (\(V_{i+1} \odot S_{i+1}\)). This acts like a spotlight, amplifying the signal of the objects we want to count and suppressing the background.
4. Representational Regional Coherence Loss (\(\mathcal{L}_{RRC}\))
How do we train this module? Standard counting datasets usually only provide point annotations (a single dot on each object). They don’t give us segmentation masks, so we don’t know exactly which pixels belong to the object and which belong to the background.
Without knowing the background, it’s hard to teach the model to distinguish “foreground” from “background.”
The authors solved this by going back to the Diffusion model. Remember Figure 3? While the single-step attention maps (a-c) were too messy for precise counting, the authors noticed they are actually quite good at finding the general foreground.
They create a Fused Cross-Attention Map (\(\mathcal{A}^{cross}\)) by combining maps from different layers:

Using this map and the ground-truth point annotations (\(D^{gt}\)), they generate a “Positive-Negative-Ambiguous” (PNA) map. This map automatically labels pixels into three categories:
- Positive (1): High density in ground truth (definitely an object).
- Negative (0): Low value in the cross-attention map (definitely background).
- Ambiguous (-1): Everything else (edges, uncertain areas).

This PNA map acts as a “pseudo-mask” to supervise the training. The loss function, Representational Regional Coherence Loss (\(\mathcal{L}_{RRC}\)), forces the model’s predicted similarity map (\(S\)) to match this PNA map.

The loss is split into positive and negative components:
Positive Loss: Ensures positive regions have high similarity.

Negative Loss: Ensures background regions have zero similarity.

Crucially, the “Ambiguous” regions are ignored. This prevents the model from being penalized for uncertainty in the fuzzy boundaries between objects.
Final Loss Function: The total training objective combines the standard counting regression loss (\(\mathcal{L}_{reg}\)) with this new coherence loss.

Experiments and Results
The authors evaluated T2ICount on the standard FSC-147 dataset, which contains 147 object categories. However, they identified a major flaw in this benchmark: Bias.
In FSC-147, the text prompt almost always refers to the majority class in the image. If there are 100 birds and 1 dog, the prompt is usually “birds.” This doesn’t test if the model is listening to the text; it only tests if the model can count the most obvious repeating pattern.
Introducing FSC-147-S
To fix this, the authors curated a new subset called FSC-147-S. In this dataset, they specifically annotated the minority classes. For example, in an image with many dominant objects, the prompt might ask for a less frequent object. This is a much harder, “contra-bias” test.
Quantitative Performance
1. Performance on Standard FSC-147 Even on the standard dataset, T2ICount outperforms existing methods.

In Table 1, you can see T2ICount achieves the lowest Mean Absolute Error (MAE) of 11.76 on the test set, significantly beating the previous best, CounTX (15.88).
2. Performance on the Harder FSC-147-S The real strength of the model appears when tested on the new, unbiased dataset.

Table 2 shows a dramatic difference.
- CLIP-Count: MAE 48.42
- VLCounter: MAE 35.24
- T2ICount: MAE 4.69
This is an improvement of over 85%. While other models completely fail to count the minority class (likely counting the majority class instead), T2ICount successfully adheres to the text description.
3. Generalization (CARPK Dataset) To ensure the model isn’t just memorizing FSC-147, they tested it on a completely different dataset of car parks (CARPK).

Table 3 shows T2ICount generalizes well, achieving comparable or better performance than methods specifically designed or tuned for car counting, despite being a zero-shot model.
Qualitative Analysis
The visual results confirm the numbers. Let’s compare T2ICount against VLCounter in Figure 4.

- Strawberries (Top): VLCounter (21.6) undercounts significantly. T2ICount (40.2) is very close to the Ground Truth (38.0). The similarity map (right column) shows T2ICount clearly activating on the red berries.
- Bottles (Middle): This is a difficult scene with glare. T2ICount nails the count (2.0).
- Dogs (Bottom): VLCounter sees patterns in the windows and thinks they are dogs (prediction 10.4). T2ICount correctly identifies the single dog (prediction 1.2).
We can see further examples of the model’s precision in Figure 5.

The thermal-like heatmaps on the right are the Text-Image Similarity Maps. Notice how clean they are.
- In the Blueberries example (top left), the map highlights the dark berries and ignores the red strawberries next to them. This is exactly the “cross-modal understanding” the paper aims for.
- In the Sheep example (top right), the model picks out small white dots against a complex background.
Ablation Study
Does every part of the model matter? The authors performed an ablation study (removing parts of the model to see what breaks).

Table 4 reveals:
- Baseline (B): Just using diffusion features is okay for standard tasks but terrible for the hard FSC-147-S dataset (MAE 24.34).
- B + \(\mathcal{L}_{RRC}\): Adding the coherence loss improves the hard dataset score massively (MAE drops to 9.59). This proves the supervision from the attention maps is crucial.
- Full Model (+ HSCM): Adding the hierarchical correction brings the error down to 4.69. Both components are essential for high precision.
Conclusion and Implications
The “T2ICount” paper represents a significant maturity in the field of Zero-Shot Counting. It moves beyond the “naive” application of vision-language models like CLIP and demonstrates that the generative priors inside Diffusion Models are incredibly powerful for discriminative tasks like counting.
Key Takeaways:
- Diffusion > CLIP for Pixel Tasks: Diffusion models understand local pixel semantics better than CLIP, which is biased towards global summaries.
- Correction is Key: Single-step feature extraction is efficient but noisy. The Hierarchical Semantic Correction Module effectively bridges the gap between text and visual features.
- Smart Supervision: The \(\mathcal{L}_{RRC}\) loss cleverly uses the model’s own internal attention maps to create “free” background masks, solving the problem of weak supervision in point-annotated datasets.
- Benchmarking Matters: The introduction of FSC-147-S exposes a hidden bias in previous research, showing that true “text-guided” counting requires testing on minority classes.
For students and researchers, this paper serves as an excellent example of how to repurpose generative models (Stable Diffusion) for perception tasks. It suggests that the boundaries between “generating” an image and “understanding” an image are becoming increasingly blurred. As we look to the future, we can expect more discriminative tasks (segmentation, detection, counting) to be solved by unlocking the latent knowledge within generative foundation models.
](https://deep-paper.org/en/paper/2502.20625/images/cover.png)