Imagine looking at a photograph of a packed stadium or a bustling city square. Your task is to count every single person. In computer vision, this is the task of Crowd Counting, and it is critical for urban planning, safety monitoring, and traffic control.

Deep learning has made massive strides in this field. However, there is a bottleneck: data annotation. To train a model to count people, humans currently have to manually place a dot on the head of every single person in thousands of training images. In a dense crowd, a single image might contain thousands of people. The labor cost is astronomical.

This brings us to Semi-Supervised Learning (SSL)—a technique where we train a model using a small set of labeled data and a massive set of unlabeled data. It sounds like the perfect solution, but as researchers from the City University of Hong Kong discovered, standard point-based counting methods crash and burn when applied to SSL.

In this post, we will dive deep into their paper, “Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting.” We will explore why the popular Point-to-Point (P2P) method fails in semi-supervised settings, visualize the failure using a novel tool called PSAM, and understand their solution: a Point-to-Region (P2R) framework that not only stabilizes training but also speeds it up significantly.


1. The Landscape of Crowd Counting

Before understanding the innovation, we need to understand the status quo. Modern crowd counting generally falls into two categories:

  1. Density Map Regression: The model predicts a heat map where the sum of pixel values equals the total count.
  2. Point Detection: The model predicts exact coordinates (points) for each person.

Point detection is often preferred because it gives us precise localization—we know exactly where each person is, not just a vague density blob. A leading approach in this domain is P2PNet, which uses a Point-to-Point (P2P) matching strategy.

The P2P Workflow

In a fully supervised setting (where we have all the labels), P2P works by using the Hungarian Algorithm. It looks at the points predicted by the model and attempts to perform a one-to-one match with the ground truth points (the human annotations) to minimize the distance error.

If a predicted point matches a ground truth point, it is trained as a positive sample. If it doesn’t match, it is trained as background.

The Semi-Supervised Dream

The researchers aimed to adapt this point-based approach to a semi-supervised framework. The standard pipeline for SSL involves a Teacher-Student architecture.

Figure 1. The workflow of semi-supervised point-based counting methods. The teacher model generates pseudo labels by extracting the foreground pixels, while the student model takes the corresponding strongly augmented image as input to construct the computation graph. The training loss between the pseudo label and the student’s prediction involves two steps: the proposed P2R matching and the weighted cross-entropy computation.

As shown in Figure 1 above:

  1. The Teacher (pretrained on a little bit of labeled data) looks at an unlabeled image and makes a prediction.
  2. These predictions become Pseudo-Labels.
  3. The Student looks at a strongly augmented version of the same image and tries to match the Teacher’s pseudo-labels.

It seems straightforward. But when the researchers tried to plug the standard P2P matching into this pipeline, the training collapsed.


2. The Collapse: Why P2P Fails in Semi-Supervised Learning

To understand the failure, we have to look at the loss functions. In a supervised setting, the simplified P2P model consists of a feature extractor \(\mathcal{F}\) and a decoder \(\mathcal{D}\) that outputs a set of points \(\mathcal{P}\).

Equation 2

The model is trained using a loss function \(\mathcal{L}_l\) that balances positive matches and negative background pixels:

Equation 4

However, in Semi-Supervised Learning, we deal with uncertainty. The Teacher model isn’t perfect. Therefore, we filter the Teacher’s pseudo-labels using a confidence threshold (usually \(\tau > 0.5\) or higher). We only trust the points the Teacher is “sure” about.

The set of pseudo-labels \(\mathcal{P}'_t\) is defined as:

Equation 5

Here lies the trap. In P2P matching, we use the Hungarian algorithm to match student predictions to these filtered pseudo-labels. We generate a confidence mask \(\mathbf{Z}\) based on whether the teacher’s prediction was confident enough:

Equation 6

When we compute the unsupervsied loss \(\mathcal{L}_u\) for the unlabeled data, it looks like this:

Equation 7

The Critical Flaw: In P2P matching, if a pixel is not matched to a high-confidence pseudo-label, it is ignored. The confidence scores are only calculated for the foreground (the matched points). There is no mechanism to propagate confidence to the background.

Mathematically, the second term of the loss equation (which is supposed to supervise the background pixels) effectively becomes zero:

Equation 8

Because the background loss term vanishes, the model receives no supervision for the background. It is only told “this specific pixel is a person,” but it is never told “the pixels next to it are NOT a person.”

As a result, the model starts to “hallucinate.” It predicts people everywhere in the local region surrounding a real person.


3. Visualizing the Failure: Point-Specific Activation Maps (PSAM)

The researchers didn’t just theoretically identify this issue; they visualized it. They developed a new visualization method called Point-Specific Activation Map (PSAM).

Unlike standard activation maps (like Grad-CAM) that show global attention for a classification task, PSAM is designed to show the receptive field and activation intensity for individual predicted points.

Figure 2. The generation process of PSAM.

The PSAM is computed by calculating the gradient of a specific point’s score with respect to the feature map. This tells us exactly which pixels in the image feature map contributed to predicting a person at that specific location.

The Observation

When comparing a model trained with only labeled data (Model-L) against a model trained with the broken semi-supervised P2P method (Model-U), the difference is stark.

Figure 3. Observations in PSAM. (a) The training process. (b) Comparing sorted values of PSAMs. (c) & (d) Visualizing the average of local PSAM, and (e) & (f) the aggregated PSAM to compare model-L and model-U from a global perspective.

Look at Figure 3 above:

  • (a) The validation error (MSE/MAE) spikes after epoch 100, which is exactly when the unlabeled data is introduced.
  • (c) vs (d): The local PSAM for the supervised model (c) is tight and focused. The semi-supervised model (d) is diffused and blurry.
  • (e) vs (f): Globally, the semi-supervised model (f) is “over-activated.”

The Diagnosis: Because the background loss was zero, the model never learned to suppress the features immediately surrounding a person. The features “leaked” into the background, causing the decoder to misinterpret the area around a person as more people.


4. The Solution: Point-to-Region (P2R) Matching

To fix this, the researchers proposed moving away from the rigid one-to-one matching of P2P. Instead, they introduced Point-to-Region (P2R) matching.

The core philosophy of P2R is simple: A pedestrian is not a single pixel; they occupy a region.

How P2R Works

Instead of using the Hungarian algorithm to find a single matching pixel, P2R defines a “dominated zone” or a local region for each ground truth (or pseudo-label) point.

Figure 4. Difference between P2P and P2R matching.

As illustrated in Figure 4:

  • (a) P2P Scheme: Focuses only on connecting specific points. If a connection isn’t made, the data is ignored.
  • (d) P2R Scheme: Segments the area around a point. It treats the relationship as a “region.”

Step 1: Region Definition First, the algorithm identifies the nearest neighbors and defines a region mask \(\beta\). A pixel is considered part of a region if it is within a certain radius \(\mu\) of a ground truth point.

Equation 21

Step 2: Matching Matrix The matching matrix \(\mathbf{M}\) is no longer a result of bipartite graph matching. It is constructed by assigning pixels to their nearest ground truth point, effectively creating a Voronoi-like segmentation, but constrained to the local region.

Equation 18

Step 3: Selecting the Representative Pixel Within this defined region, we still want to find the best pixel to represent the person’s head. We calculate a cost matrix \(\mathbf{C}\) based on spatial distance and the model’s current confidence score.

Equation 22

The pixel with the minimum cost in that region is selected as the training target (the “one” in the one-hot label), while the others in the region help define the context.

Why P2R Solves the Semi-Supervised Problem

The magic of P2R lies in how it handles confidence propagation for the unlabeled data.

In the P2R scheme, we can define a confidence matrix \(\mathbf{Z}\) that includes the background:

Equation 24

This equation says:

  1. For the Foreground (the matched region): Trust the pseudo-label if the teacher’s confidence is high (\(\zeta\)).
  2. For the Background (pixels far away from any pseudo-point, represented by \(1-\beta\)): We can implicitly trust that these are not people.

In P2P, if a point was low-confidence, we ignored it and its surroundings. In P2R, even if we are unsure about the specific center point, we know the surrounding vast empty space is definitely background. This brings back the background supervision term that was missing in Equation 8.

The full computation flow of the unsupervised loss \(\mathcal{L}_u\) in P2R is visualized below:

Figure 8. Computation of Lu in P2R.

Bonus: Efficiency

A major side benefit of P2R is speed. The P2P method relies on the Hungarian algorithm, which has a complexity of \(O(N^3)\). P2R relies on simple matrix operations (finding the minimum value in a region), which is highly parallelizable on GPUs.

As noted in the paper, for an image with ~800 points:

  • P2P matching: ~0.43 seconds
  • P2R matching: ~0.006 seconds (~68x faster)

5. Experimental Results

The researchers tested P2R on standard crowd counting benchmarks: ShanghaiTech A/B, UCF-QNRF, and JHU++. They tested protocols using 5%, 10%, and 40% labeled data.

Semi-Supervised Performance

Table 1. Comparison with other recent methods on four benchmark datasets under different labeled protocols.

The results in Table 1 are compelling:

  • Dominance: P2R outperforms state-of-the-art methods (like OT-M and DAC) across almost all settings.
  • Label Efficiency: Remarkably, P2R trained on just 5% of labeled data often outperforms older fully-supervised methods and rivals other semi-supervised methods that use 10% data.
  • Fully Supervised: Even when 100% of labels are available (bottom rows), P2R outperforms the original P2PNet, proving that the “Region” concept is superior to strict “Point” matching even without the semi-supervised aspect.

Visual Confirmation

Does the model actually produce better density maps? Yes.

Figure 5. qualitative comparison with other models.

In Figure 5, you can see that P2R (last column) produces sharper, more accurate density estimations compared to DAC and OT-M, especially in very dense areas.

We can also look at the specific predictions on challenging scenes from the UCF-QNRF dataset:

Figure 10. Visualization of P2R’s Prediction (# 1).

Figure 11. Visualization of P2R’s Prediction (#2)

The model effectively handles varied scales—from people in the distance to large crowds in the foreground—and correctly ignores complex background structures.

PSAM Check: Did we fix the “Hallucinations”?

Recall the “over-activation” problem seen in the P2P semi-supervised model. Let’s look at the PSAM for the P2R-trained model (Model-U_P2R, shown in orange/red below).

Figure 7. The comparison of PSAMs among different training schemes.

In Figure 7(a), the P2R model (orange line) has a sharp drop-off in activation similar to the supervised model (green), unlike the broken P2P model (blue) which stays high. The heatmaps in (b) confirm this: the P2R activations are tight and focused on the actual pedestrians, effectively suppressing the surrounding noise.


6. Conclusion: From Points to Regions

The paper “Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting” identifies a subtle but devastating flaw in how point-based detection is applied to semi-supervised learning. By treating pedestrian detection as a strict point-matching problem, traditional losses fail to supervise the background when using pseudo-labels, leading to feature over-activation.

The P2R solution is elegant because it aligns better with the visual reality:

  1. Objects occupy space (regions), not just single pixels.
  2. Supervising the background (knowing where people aren’t) is just as important as supervising the foreground.

By defining a region for supervision, P2R allows confidence to propagate correctly, stabilizes semi-supervised training, and offers a massive speed boost by ditching the Hungarian algorithm. This method sets a new standard for training high-performance crowd counters with minimal human annotation.