Introduction
In the rapidly evolving world of Computer Vision, there is a prevailing tendency to solve complex problems by adding complexity to architectures. When the Vision Transformer (ViT) burst onto the scene, it revolutionized image classification. However, when researchers tried to apply it to more granular tasks like image segmentation—where the goal is to classify every pixel—the consensus was that the “plain” ViT wasn’t enough.
To bridge the gap, the field developed a standard recipe: take a ViT, attach a heavy “adapter” to extract multi-scale features (mimicking Convolutional Neural Networks), add a “pixel decoder” to fuse these features, and finally, cap it off with a complex “Transformer decoder” to generate masks. State-of-the-art models like Mask2Former follow this pattern, becoming powerful but architecturally convoluted and computationally heavy.
But what if we’ve been underestimating the Vision Transformer all along?
A recent research paper, Encoder-only Mask Transformer (EoMT), challenges this status quo. The authors pose a provocative question: Is your ViT secretly an image segmentation model already? They hypothesize that with the rise of massive pre-training (like DINOv2) and larger model sizes, the “inductive biases” provided by those extra adapters and decoders are no longer necessary.
The result of this inquiry is the Encoder-only Mask Transformer (EoMT). It is a model that strips away almost all the bells and whistles, relying on a plain ViT architecture, yet achieves performance competitive with state-of-the-art models while being up to 4x faster.

As shown in Figure 1 above, EoMT (the orange line) maintains high Panoptic Quality (PQ) even at very high frame rates (FPS), leaving traditional complex architectures (the blue line) in the dust. In this post, we will deconstruct how the researchers achieved this, diving deep into the architecture, the novel training strategy called “mask annealing,” and the implications for the future of efficient computer vision.
Background: The Segmentation Landscape
Before we dismantle the complex architectures, we need to understand what they are trying to achieve. Image segmentation generally comes in three flavors:
- Semantic Segmentation: Labeling every pixel with a class (e.g., “road,” “sky,” “car”). It treats all cars as one blob.
- Instance Segmentation: Detecting individual objects (e.g., “Car A,” “Car B”). It ignores background stuff like the sky.
- Panoptic Segmentation: The “holy grail” that combines both. You want to identify “Car A” distinct from “Car B,” but also label the sky and road.
The Standard “Mask Transformer” Recipe
To solve panoptic segmentation using Transformers, modern approaches like Mask2Former use a query-based approach. The model learns a set of “object queries”—vectors that ask the image, “Is there a car here?” or “Is there grass here?”
However, standard ViTs process images as fixed-size patches (e.g., 16x16 pixels) at a single scale. Traditional wisdom holds that segmentation requires:
- Multi-scale features: Seeing the image at high resolution for edges and low resolution for context.
- Local processing: Convolutions to understand pixel neighborhoods.
To force a ViT to do this, researchers wrap it in layers of complexity:
- ViT-Adapter: A convolutional network running parallel to the ViT to inject multi-scale information.
- Pixel Decoder: A module to upsample and fuse these features.
- Transformer Decoder: A separate stack of transformer blocks where queries attend to the image features to predict masks.
The authors of EoMT argue that these components are essentially “training wheels.” They help when the model is small or untrained, but they become redundant when the backbone is powerful enough.
The Core Method: Stripping the Engine Down
The primary contribution of this paper is not adding a new module, but rigorously removing existing ones to find the minimal viable architecture. This process is a lesson in architectural ablation.
Step 1: Deconstructing the Complex Pipeline
The researchers started with the heavy ViT-Adapter + Mask2Former architecture and systematically removed components. Let’s visualize this evolution.

As illustrated in Figure A:
- (0) The Baseline: The full complex model with Adapter, Pixel Decoder, and Transformer Decoder.
- (1) Removing the Adapter: The convolutional sidecar is gone. The model now relies solely on the ViT for feature extraction. To compensate, they use simple transposed convolutions to create a feature pyramid from the ViT’s output.
- (2) Removing the Pixel Decoder: Instead of a complex fusion module, they just use simple upscaling/downscaling to resize features.
- (3) Removing Multi-scale Features: They realize that generating features at 4 different scales might be overkill. They switch to using the single-scale output of the ViT (upscaled only once for the final mask prediction).
By step (4) in their experiments, they arrive at a setup where a Transformer Decoder is sitting directly on top of a plain ViT. But can we go further? Can we remove the decoder entirely?
Step 2: The EoMT Architecture
The final leap leads to the Encoder-only Mask Transformer (EoMT).
In a standard Transformer Decoder, object queries cross-attend to image features. But a ViT Encoder block already has an attention mechanism (Self-Attention). The researchers realized they could repurpose the final layers of the ViT to handle the queries.
Here is the simplified architecture:

How it works:
- Stage 1 (Standard ViT): The image passes through the first \(L_1\) blocks of the ViT just like normal. It processes patch tokens.
- Concatenation: A set of learnable “Queries” is introduced and simply concatenated to the patch tokens.
- Stage 2 (Joint Processing): For the final \(L_2\) blocks, the ViT processes both the image patches and the queries together.
- Prediction: The queries are plucked off the end to predict class labels and masks.
Mathematically, the ViT block updates the tokens \(X^i\) via Multi-Head Self-Attention (MHSA) and an MLP. The equations for the block remain the standard Transformer equations:

In EoMT, \(X^i\) simply contains both image tokens and query tokens. Through self-attention, the queries can “look at” the image patches to gather information, and the image patches can “look at” the queries. This unifies the encoding and decoding steps into a single, optimized stack.
Step 3: The Problem with Masked Attention
There was one major hurdle. State-of-the-art models like Mask2Former rely heavily on Masked Attention.
In Masked Attention, a query is restricted to only attend to the specific region of the image it currently predicts as a mask. If a query thinks it is looking at a “dog,” it is forced to ignore the “background” pixels during attention. This acts as a strong prior that helps the model converge and improves accuracy.
The visualization below shows how masks (left) dictate the attention map (center grid), specifically the blue “Query-to-patch” region.

The Trade-off:
- Pro: Masked attention improves segmentation quality significantly.
- Con: It is slow. You have to generate an intermediate mask, threshold it, and modify the attention matrix at every single layer during inference. This kills the speed advantage of using a plain ViT.
If the researchers kept masked attention, EoMT would be accurate but not much faster than the baseline. If they removed it, the accuracy dropped.
Step 4: Mask Annealing
The solution proposed is elegant: Mask Annealing.
The hypothesis is that masked attention is crucial for learning (helping the model figure out what objects are early on), but perhaps unnecessary for inference once the model is mature.
The researchers propose a training schedule where the probability of applying the mask starts at 100% and gradually decays to 0%.

As shown in Figure 4:
- Early Training (Yellow/Green lines): The blocks are heavily masked. The queries are guided to look only at their specific regions.
- Late Training: The masking probability (\(P_{mask}\)) drops. The model is forced to learn to segment without the crutch of the attention mask.
- Inference: The mask is turned off completely.
This strategy allows EoMT to have its cake and eat it too: the high accuracy of masked training with the blazing speed of mask-free inference.
Experiments & Results
The researchers validated EoMT on standard benchmarks like COCO (Panoptic/Instance) and ADE20K (Semantic). They compared it primarily against the ViT-Adapter + Mask2Former baseline.
1. Speed vs. Accuracy
The most striking result is the efficiency gain. Let’s look at the stepwise breakdown of performance.

Key Takeaways from Table 1:
- Row (0): The baseline achieves 57.1 PQ at 29 FPS.
- Row (5): EoMT achieves 56.0 PQ but jumps to 128 FPS.
- The Trade-off: We lose about 1.1 points in Panoptic Quality, but we gain a 4.4x speed increase.
- Why is it so fast? Notice step (5). Removing the masking (made possible by annealing) doubles the speed from 61 FPS to 128 FPS. The plain ViT is extremely optimized on modern hardware (GPUs), whereas custom operations like mask extraction create bottlenecks.
2. The Critical Role of Pre-training
The paper posits that the complexity of previous models was a band-aid for weak backbones. If the backbone is strong enough, the band-aid isn’t needed. This is proven by comparing different pre-training strategies.

Looking at Table 2:
- IN1K (Standard ImageNet): EoMT fails miserably (44.3 PQ vs 50.4 for baseline). It needs the complex decoder here.
- DINOv2: The gap closes significantly (56.0 vs 57.1).
- Conclusion: The success of EoMT is contingent on high-quality, large-scale pre-training like DINOv2 or EVA-02. These foundation models learn such rich dense features that the architecture doesn’t need to work as hard to extract them.
3. Performance Across Tasks
EoMT isn’t just for panoptic segmentation. It generalizes well.
Panoptic Segmentation (COCO):
In Table 4, looking at the ViT-L rows, EoMT runs at 128 FPS compared to the baseline’s 29 FPS, with competitive accuracy.
Out-of-Distribution (OOD) Generalization: One of the hidden benefits of removing custom components is that you rely entirely on the pre-trained backbone. Foundation models like DINOv2 are famous for their robustness.

In Figure B, the model encounters a giraffe (an object it might not have been explicitly trained to segment in this specific context).
- Panel 2 (Baseline): The complex model is “confidently wrong” in the background areas.
- Panel 3 (EoMT): The EoMT model produces a much cleaner confidence map, correctly highlighting the giraffe and suppressing the background. Because it relies directly on the DINOv2 features without an adapter distorting them, it inherits DINOv2’s robustness.
Table 8 confirms this quantitatively:

EoMT achieves nearly identical In-Distribution (ID) performance to the baseline but maintains high Out-of-Distribution (OOD) performance (77.2 vs 78.0), whereas older architectures based on Swin transformers drop significantly (69.4).
4. Visual Quality
Does the simplified model actually produce good masks?

Figure C compares the baseline (Row 3) with EoMT (Row 4).
- Look at the last column (the person in the kitchen). The baseline struggles slightly with the refrigerator/person boundary.
- EoMT produces sharp, coherent masks that are visually indistinguishable from (or sometimes better than) the heavier baseline, despite being significantly simpler.
Conclusion and Implications
The Encoder-only Mask Transformer (EoMT) offers a compelling narrative correction for Computer Vision. For years, the field assumed that applying Vision Transformers to segmentation required re-introducing the inductive biases of Convolutional Networks through adapters and complex decoders.
This paper proves that your ViT is secretly an image segmentation model.
By leveraging:
- Strong Pre-training (DINOv2),
- Architectural Simplicity (Plain ViT + Query concatenation), and
- Mask Annealing (Gradual removal of attention masks),
EoMT achieves a “sweet spot” in the accuracy/efficiency trade-off. It provides state-of-the-art speed (up to 128 FPS on ViT-L) with competitive accuracy.
The Broader Lesson: Rather than spending computational budget on architectural complexity (adapters, decoders), resources are better spent on scaling the backbone and improving pre-training. A simple, plain architecture is not only faster but also easier to optimize and more compatible with future hardware advances. In the era of Foundation Models, less truly is more.
](https://deep-paper.org/en/paper/2503.19108/images/cover.png)