Introduction

Imagine standing in a bustling airport security line. As your bag disappears into the X-ray tunnel, a security officer stares intently at a monitor, deciphering a complex, pseudo-colored jumble of overlapping shapes. Their job is to identify threats—guns, knives, explosives—hidden amidst cables, laptops, and clothes. This task requires immense concentration, and human fatigue or distraction can lead to critical errors.

While Artificial Intelligence has stepped in to assist with Computer-Aided Screening (CAS), current systems have a major limitation: they operate in a “closed-set” paradigm. This means they can only detect specific items they were explicitly trained on. If a threat is a novel variation—like a 3D-printed gun made of polymer or a dismantled explosive hidden inside a radio—traditional models often fail. Furthermore, general-purpose Large Multimodal Models (LMMs) like GPT-4, which excel at describing natural images, struggle significantly when presented with the translucent, overlapping nature of X-ray imagery.

In this deep dive, we explore a groundbreaking paper that proposes a solution: STING-BEE. The researchers introduce a new domain-aware visual AI assistant capable of not just detecting objects, but understanding complex scenes, answering questions, and localizing threats even when they are heavily concealed. To achieve this, they also constructed STCray, the first large-scale multimodal X-ray dataset designed to mimic real-world smuggling tactics.

The Problem with Current X-ray AI

To understand why STING-BEE is necessary, we must first look at why current “state-of-the-art” general AI models fail in this domain. Models like GPT-4 or Gemini are trained on billions of natural images (photographs). X-ray images, however, are fundamentally different. They represent density and material type through color (orange for organic, blue for metal) and transparency.

When a standard Vision-Language Model (VLM) looks at a cluttered X-ray scan, it often “hallucinates”—confidently describing objects that aren’t there or misinterpreting medical scans for baggage.

Comparison of AI models interpreting X-ray scans. Figure 1: A comparison of captions generated by GPT-4, Gemini-1.5 Pro, LlaVa-NeXT, and the proposed STING protocol. Notice how general models (left columns) fail to identify threats like the 3D-printed gun or mistake the baggage for medical imaging, whereas the proposed method (right column) accurately describes the scene.

As shown in Figure 1, while models like LLaVA-NeXT might guess the presence of electronics, they miss critical threats like a 3D-printed gun or a blade. This gap highlights the need for a specialized dataset and model tailored specifically for the nuances of security screening.

STCray: Building a Multimodal Foundation

The core contribution that enables STING-BEE is the creation of STCray, the first multimodal X-ray baggage security dataset. Unlike previous datasets that might only provide a label (e.g., “Gun: Present”), STCray provides 46,642 image-caption pairs.

This dataset was not built by simply scanning random bags. The researchers spent over 3,000 hours meticulously curating baggage to reflect real-world threats and strategic concealment.

Overview of STCray and comparison with other datasets. Figure 2: An overview of the STCray dataset components and a comparison with existing public datasets. STCray stands out by supporting multimodal data, strategic concealment, and novel threats like 3D-printed firearms.

As Figure 2 illustrates, STCray is unique in its support for emerging novel threats and zero-shot task capabilities. It includes 21 threat categories, ranging from standard items like knives and pliers to sophisticated threats like Improvised Explosive Devices (IEDs) and 3D-printed firearms, which are notoriously difficult to detect due to their low density.

The Challenge of 3D-Printed Threats

One of the most significant inclusions in STCray is the 3D-printed gun. Traditional scanners rely heavily on the high density of metal to flag firearms. 3D-printed guns, often made of plastic or polymer, appear as faint orange outlines that blend easily with benign organic items like clothes or food.

3D-printed firearm visualization. Figure 3: The challenge of detecting 3D-printed firearms. Left: A photo of a ‘Maverick’ 3D-printed gun. Center: The gun hidden in a bag. Right: The X-ray scan where the gun appears as a faint, orange outline, making it incredibly difficult to distinguish from the background clutter.

Figure 3 demonstrates this difficulty perfectly. In the X-ray scan (right), the firearm is barely visible, ghost-like against the background. Training a model to spot these requires high-quality, specialized data.

The STING Protocol: A Recipe for Data Generation

You cannot simply take a photo of an X-ray and ask a human to “describe it” perfectly without a system; the visual clutter is too high. To solve this, the authors developed the Strategic Threat ConcealING (STING) protocol. This is a systematic method for creating data that varies based on specific parameters:

Clutter Level: Ranging from limited (few items) to extreme (densely packed bag).
Concealment: How the threat is hidden. Is it behind a laptop? Wrapped in cables?
Orientation: Is the object flat, tilted, or rotated?

STING Protocol Diagram. Figure 4: The STING protocol workflow. It begins with selecting baggage and threat types, then applies specific clutter and concealment strategies to generate precise captions describing the scene.

By controlling these variables during the scanning process, the researchers could generate ground-truth captions that are strictly accurate. For example, instead of just “a bag with a gun,” the protocol generates: “An X-ray scan including a plier aligned at an inclined angle, in the corner of a suitcase, with the plier fully covered by a CD receiver and cables…”

Levels of Concealment

The protocol is rigorous about how items are hidden. Real-world smuggling doesn’t involve leaving a weapon sitting on top of a pile of clothes. It involves “strategic concealment.”

10 Levels of Concealment. Figure 5: Illustrative examples of the 10 concealment levels utilized in the STCray dataset. The complexity increases from simple low-density coverage to multiple superimposed high-density materials.

Figure 5 visualizes this progression. At the lower levels, a threat might just be covered by a book (low density). At the highest levels, threats are superimposed with metallic grids, heavy machinery parts, or intentionally distracting clusters of cables. This forces the AI to learn to “see through” occlusions, rather than just recognizing a clear silhouette.

3D representation of STING protocol. Figure 6: A 3D representation of the STING protocol showing the interplay between clutter levels, concealment sublevels, and object location/orientation.

STING-BEE: The Architecture

With the dataset in place, the authors developed STING-BEE (Strategic Threat Identification aNd Grounding for Baggage Enhanced Evaluation).

STING-BEE is built upon the LLaVA architecture, which combines a visual encoder (to “see” the image) with a Large Language Model (to “understand” and generate text). However, standard LLaVA is designed for chat. STING-BEE is fine-tuned for specific security tasks using special tokens.

STING-BEE Training Pipeline. Figure 7: The STING-BEE training and evaluation pipeline. It moves from data collection (STCray) to multi-modal instruction tuning, resulting in a model capable of VQA, Localization, and Grounding.

Task-Specific Tokens

To make the model a versatile assistant, the authors introduced specific instruction tokens:

[refer]: Tells the model to output bounding box coordinates for a specific threat (e.g., “Where is the gun?”).
[grounding]: Tells the model to describe the whole image and provide boxes for all threats mentioned.

Synthetic Data Augmentation

To further robustify the model, the researchers used a CT scanner to capture 3D volumes of threats. They then mathematically projected these 3D volumes into 2D X-ray images from thousands of different angles.

2D Projections from CT Scans. Figure 8: 2D X-ray projections generated from 3D CT scans of threats. This augmentation technique allows the model to learn what a threat looks like from unusual angles that might be rare in the physical training data.

This augmentation (Figure 8) ensures that even if a gun is rotated in a way never before seen in the training set, STING-BEE has a high probability of recognizing its structural signature.

Experiments and Results

The researchers subjected STING-BEE to a battery of tests, comparing it against general VLMs (like LLaVA and MiniGPT) and specialized detection models.

1. Visual Question Answering (VQA)

The model was tested on its ability to answer complex questions about the X-ray scans, such as “Is the battery concealed by a metallic object?” or “How many threats are present?”.

VQA Performance Table. Figure 9: Visual Question Answering (VQA) performance comparison. STING-BEE significantly outperforms generic models like LLaVA 1.5 and MiniGPT in overall accuracy.

As shown in Figure 9, STING-BEE achieved an overall accuracy of 52.81%, surpassing LLaVA 1.5 (41.94%) and MiniGPT (36.62%). It showed particular strength in “Complex Reasoning” and “Instance Identity,” proving it understands the context of the X-ray, not just pixel patterns.

2. Cross-Domain Generalization

A major hurdle in medical and security imaging is “domain shift.” A model trained on a scanner from Manufacturer A often fails on images from Manufacturer B because of differences in color calibration or resolution.

STING-BEE was trained on STCray but tested on completely different public datasets (SIXray and PIDray).

Cross-Domain Performance. Figure 10: Cross-domain Scene Comprehension Performance. STING-BEE demonstrates superior capability in adapting to unseen datasets compared to state-of-the-art VLMs.

The results in Figure 10 are telling. STING-BEE achieved an F1 score of 34.69, nearly double that of MiniGPT (18.45). This suggests that the STCray dataset and the STING protocol successfully taught the model generalized features of threats, rather than overfitting to a specific scanner’s quirks.

3. Visual Grounding and Localization

Can the model point to the threat? This is crucial for a human operator who needs to know where to look.

Qualitative Examples of Tasks. Figure 11: Qualitative examples of STING-BEE in action across different datasets. It demonstrates success in Scene Comprehension, Referring to locations, and Visual Grounding.

Figure 11 visualizes the model’s versatility. Whether asked to “find the gun” (Referring) or “describe the image focusing on threats” (Grounding), STING-BEE successfully draws bounding boxes around the contraband, even in scans from datasets it wasn’t trained on (Compass XP, PIDray).

Grounding Performance Qualitative. Figure 12: Qualitative grounding examples. The red boxes indicate model predictions, which align closely with the blue ground-truth boxes. Note the successful detection of the 3D-printed gun (Example c).

The qualitative results in Figure 12 are impressive. The model correctly identifies a 3D-printed gun (c) and threats hidden amongst heavy cabling (e).

Limitations

No model is perfect. The authors frankly discuss instances where STING-BEE struggles.

Challenging Examples. Figure 13: Failure cases. STING-BEE sometimes struggles with heavy occlusion, identifying only a part of the object (like the tip of a knife in ’d’) rather than the whole item.

When objects are extremely obscured (Figure 13), the model might detect the presence of a threat but fail to localize it perfectly, drawing a box only around the visible tip of a knife rather than the whole weapon. Additionally, in bags with multiple diverse threats, it occasionally groups them together or confuses similar-looking tools (like wrenches vs. pliers). However, in a security context, flagging the existence of a threat is the primary priority, even if the bounding box isn’t pixel-perfect.

Conclusion

The STING-BEE paper represents a significant leap forward in aviation security technology. By moving away from simple object detection and embracing Vision-Language modeling, the researchers have created a system that can reason about clutter, concealment, and context.

The two pillars of this work—the STCray dataset, with its rigorous STING protocol for creating realistic, caption-rich training data, and the STING-BEE model, a domain-tuned VLM—provide a roadmap for the future of Computer-Aided Screening. As passenger volumes grow and threats evolve, AI assistants like STING-BEE will likely become standard partners for security officers, offering a second set of eyes that never gets tired, distracted, or confused by a 3D-printed gun hidden in a mess of cables.

Introduction#

The Problem with Current X-ray AI#

STCray: Building a Multimodal Foundation#

The Challenge of 3D-Printed Threats#

The STING Protocol: A Recipe for Data Generation#

Levels of Concealment#

STING-BEE: The Architecture#

Task-Specific Tokens#

Synthetic Data Augmentation#

Experiments and Results#

1. Visual Question Answering (VQA)#

2. Cross-Domain Generalization#

3. Visual Grounding and Localization#

Limitations#

Conclusion#