Introduction
In the evolving landscape of intelligent surveillance, we are witnessing a convergence of two distinct worlds: the ground and the sky. Traditional security systems rely heavily on CCTV cameras fixed at eye level or slightly above. However, the rapid proliferation of Unmanned Aerial Vehicles (UAVs), or drones, has introduced a new vantage point. This combination offers comprehensive coverage, but it introduces a massive computational headache known as Aerial-Ground Person Re-Identification (AGPReID).
The core problem is deceptively simple to state but incredibly hard to solve: How do you teach a computer that the person seen from a steep, top-down angle by a drone is the same person seen from a frontal view by a ground camera?

As illustrated in Figure 1, the visual appearance of the same individual changes drastically between these two views. Distinctive features visible from the ground (like a graphic on a t-shirt or facial features) might be invisible from the air, where only the top of the head, shoulders, and shoes are visible.
In this post, we are doing a deep dive into a research paper that proposes a novel solution to this problem: SeCap (Self-Calibrating and Adaptive Prompts). The researchers behind SeCap argue that traditional methods fail because they are too rigid. Instead, they propose a system that adapts its internal search criteria—or “prompts”—on the fly, depending on the specific viewpoint of the input image. We will explore how SeCap leverages the power of Vision Transformers (ViT) and prompt learning to bridge the gap between aerial and ground networks.
Background: The AGPReID Challenge
To understand why SeCap is necessary, we first need to look at why existing Person ReID methods struggle in this domain.
The Limitations of Traditional ReID
Standard Person Re-Identification focuses on matching images across different cameras that usually share similar viewpoints (e.g., matching a person walking from one hallway camera to another). When you introduce drones, the domain gap widens significantly.
Previous attempts to solve AGPReID generally fall into two categories:
- Attribute-based methods: These try to identify specific attributes (e.g., “red shirt,” “backpack”). However, attributes are often occluded in aerial views.
- View Decoupling: These methods try to mathematically separate “view” features (the camera angle) from “identity” features (the person). While promising, they often struggle with the sheer diversity of viewpoints. A drone at 20 meters looks different than a drone at 60 meters, and a CCTV at 3 meters looks different than one at 10 meters.
Existing decoupling methods often neglect local features. Because drones capture images from steep angles, body parts are often foreshortened or self-occluded. A system that only looks at the “global” picture might miss the subtle, view-invariant details required for a match.
The SeCap Framework
The SeCap method introduces a dynamic approach. Instead of using a static model to process every image, it uses adaptive prompts. In the context of Vision Transformers (ViT), a “prompt” is a learnable vector that gives the model a hint or a context about what to look for.
SeCap is built on an Encoder-Decoder Transformer architecture.
- Encoder: Extracts visual features and decouples the viewpoint.
- Decoder: Refines these features using adaptive prompts to focus on details that remain constant across different views.
Let’s visualize the complete architecture:

As shown in Figure 2(a), the input image is processed, split into view-invariant and view-specific features, and then passed through a decoding stage involving two critical innovations: the Prompt Re-calibration Module (PRM) and the Local Feature Refinement Module (LFRM).
Let’s break down the mathematical flow of the framework:

Here, VDT represents the View Decoupling Transformer (the encoder). The system produces Cls (Class/Identity) tokens and View tokens. By subtracting the View token from the Class token, the model attempts to isolate X_inv (view-invariant features). These invariant features are then used to drive the decoder.
1. The Encoder: View Decoupling Transformer (VDT)
The backbone of SeCap is a Vision Transformer (ViT). However, unlike a standard ViT that just outputs a classification token, this encoder is designed to explicitly separate the “who” from the “where.”
It introduces a View Token. As the image passes through the layers of the transformer, the model is trained to push information about the camera angle into the View Token and information about the person’s identity into the Class Token. This hierarchical decoupling ensures that when we reach the decoder, we have a clean set of features to work with.
2. The Prompt Re-calibration Module (PRM)
This is where the “Self-Calibrating” part of the name comes in. In standard prompt learning, prompts are fixed, learnable vectors. In SeCap, the prompts need to change based on the input image.
The PRM functions as a bridge. It takes a set of initialized prompts and “re-calibrates” them using the view-invariant features extracted by the encoder.

As described in the equation above:
- Cross-Attention (CA): The prompts attend to the view-invariant features (
X_inv). This aligns the generic prompts with the specific content of the current image. - Self-Attention (SA): The prompts interact with each other to integrate information.
- Feed-Forward Network (FFN): A final processing step to produce the re-calibrated prompts (
P_re).
This process ensures that the prompts are not just looking for generic human features, but are specifically tuned to look for the identity features present in the current image, regardless of the view.
3. The Local Feature Refinement Module (LFRM)
Once the prompts are calibrated, they are used to dig into the local features of the image. Remember, in aerial views, we might lose global shape, so local details (like the texture of hair or the pattern on a shoe) become critical.
The LFRM uses a Two-Way Attention mechanism. It’s not enough for the prompt to look at the image; the image features must also update based on the prompt.

In this two-step process:
- Prompt-to-Image: The prompts gather information from the local image features (
F_local). - Image-to-Prompt: The image features update themselves based on the refined prompts.
This bi-directional flow allows the model to “refine” the local features, effectively highlighting the parts of the image that are relevant for identification while suppressing the noise caused by the camera angle.

Finally, the system fuses these refined local features with the global output to create a comprehensive representation of the person.

Optimization: Teaching the Model
How does SeCap learn to perform these complex tasks? The researchers employ a multi-faceted loss function.
1. View Classification Loss: To ensure the “View Token” actually captures viewpoint information, the model is trained to predict the camera view (e.g., aerial vs. ground).

2. Orthogonality Loss: This is a crucial regularization technique. We want the “Identity” features and the “View” features to be distinct. If they overlap, the model hasn’t truly decoupled them. This loss function forces the vector representations of the view and the identity to be orthogonal (mathematically perpendicular), minimizing their correlation.

3. Overall Loss: The final training objective combines the standard ID classification and Triplet losses (common in ReID tasks) with the specific view and orthogonality constraints.

Here, \(\lambda\) (lambda) is a hyperparameter that balances the importance of the view-specific losses against the standard identification losses.
New Benchmarks: LAGPeR and G2APS-ReID
One of the significant hurdles in AGPReID research has been the lack of high-quality, large-scale datasets. Existing datasets were either too small or synthetically generated. To address this, the authors contributed two new datasets.

LAGPeR (Large-scale Aerial-Ground Person Re-Identification)
This dataset was independently collected and annotated by the authors. It includes:
- 4,231 Identities
- 63,841 Images
- 21 Cameras (7 drones, 14 ground)
It covers various real-world conditions including different lighting (day/night), weather (sunny/rainy), and extensive occlusion.

G2APS-ReID
This dataset was reconstructed from a person search dataset (G2APS). The authors re-partitioned it to make it suitable for the ReID task, resulting in a massive collection of 200,864 images covering 2,788 identities.
The introduction of these datasets provides a more rigorous testing ground for cross-view algorithms, moving the field away from synthetic data toward real-world applicability. The experimental setup for these datasets is rigorous, testing both Aerial-to-Ground (\(A \rightarrow G\)) and Ground-to-Aerial (\(G \rightarrow A\)) retrieval.

Experiments and Results
Does SeCap actually work? The experimental results suggest a resounding yes. The authors compared SeCap against state-of-the-art methods, including standard ViT baselines and other cross-view specific models like VDT and AG-ReID.
Quantitative Performance
On the new LAGPeR and G2APS-ReID datasets, SeCap consistently achieved the highest scores in both Rank-1 Accuracy (how often the correct person is the first result) and mAP (mean Average Precision).

For example, on the LAGPeR dataset in the Aerial-to-Ground setting, SeCap achieved a Rank-1 accuracy of 41.79%, significantly outperforming the VDT method (40.15%) and the standard ViT baseline (38.67%). While these numbers might seem low compared to traditional single-view ReID (which often exceeds 90%), they represent a substantial leap in the difficult AGPReID domain.
The method also performed exceptionally well on existing datasets like AG-ReID.v1, demonstrating its robustness across different data sources.

Ablation Studies: What Matters Most?
The authors performed ablation studies to verify that every component was necessary. They tested the model by removing the VDT, the LFRM, and the PRM one by one.

The results show that while the View Decoupling Transformer (VDT) and Local Feature Refinement (LFRM) improve performance individually, the combination of all three—specifically adding the Prompt Re-calibration Module (PRM)—yields the best results (Row 6). This confirms that adaptively calibrating prompts based on the view-invariant features is key to the model’s success.
Visualizing the Success
Numbers are great, but visualizations help us intuit what the model is learning.
Feature Distribution (t-SNE): In the t-SNE plots below, we see the feature space of the baseline model vs. SeCap. In the baseline, the “Aerial” (circles) and “Ground” (crosses) distributions are somewhat separated, indicating the model is distracted by the viewpoint. In SeCap, the distributions are much more overlapped and clustered by identity. This means the model has successfully learned to ignore the camera angle and focus on the person.

Retrieval Results: Below is a comparison of retrieval results. The green boxes indicate correct matches, while red boxes are errors. SeCap retrieves correct matches even when the viewpoint changes dramatically, whereas the baseline often retrieves images that look similar in “view” (e.g., similar background or angle) but are the wrong person.

Attention Maps: Perhaps the most telling visualization is the attention map. The baseline model often focuses on the background or general clothing blobs. SeCap, however, focuses on discriminative parts like the head and shoulders—features that are visible from both ground and aerial views.

Parameter Analysis
The researchers also analyzed the impact of prompt length and the loss hyperparameter (\(\lambda\)).

The analysis reveals that the model is relatively robust to changes in prompt length (Figure 7), showing stable performance around a length of 64. The \(\lambda\) parameter (Figure 6), which controls the strength of the view-decoupling losses, shows a sweet spot—too low, and the model ignores viewpoint; too high, and it focuses too much on the view rather than the identity.
Conclusion
The SeCap paper presents a sophisticated solution to one of the hardest problems in modern surveillance: connecting the dots between aerial drones and ground cameras. By moving away from static feature extraction and embracing Self-Calibrating and Adaptive Prompts, the framework allows the model to dynamically adjust its focus based on the input image.
The key takeaways are:
- View Decoupling is essential: You must mathematically separate the “camera angle” from the “person” to succeed in cross-view tasks.
- Adaptability is king: Static models fail under the extreme variance of drone angles. Adaptive prompts bridge this gap.
- Local features matter: When the global shape is distorted by perspective, refining local features (like heads and shoulders) is crucial for accuracy.
- Data drives progress: The release of LAGPeR and G2APS-ReID will likely spur further innovation in this field by providing realistic, difficult benchmarks.
SeCap represents a significant step forward in creating unified, robust observation networks that can identify individuals regardless of where the camera is placed—be it on a wall or in the clouds.
](https://deep-paper.org/en/paper/2503.06965/images/cover.png)