Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy

In the rapidly evolving world of Computer Vision, the Vision Transformer (ViT) has been a revolutionary force. By adapting the self-attention mechanisms originally designed for Natural Language Processing (NLP), ViTs have achieved state-of-the-art results in image classification, object detection, and segmentation.

However, there is a catch. The very mechanism that makes Transformers so powerful—Self-Attention—is computationally expensive. Specifically, it has “quadratic complexity.” As the resolution of an image increases, the computational cost explodes. This makes standard Transformers notoriously difficult to deploy on resource-constrained devices like mobile phones, where battery life and latency are critical.

Researchers have been hunting for a “Mobile-Friendly” Transformer—one that retains the global understanding of a ViT but runs as fast as a lightweight Convolutional Neural Network (CNN).

In this post, we will deep dive into a fascinating solution presented in the paper “CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction.” We will explore how the authors managed to decouple feature learning to achieve a new state-of-the-art balance between efficiency and accuracy.

1. The Bottleneck: Why Mobile Vision is Hard

To understand the solution, we must first understand the problem.

The Cost of Self-Attention

In a traditional Transformer, every pixel (or token) attends to every other pixel. If you have an image with \(N\) tokens, the model calculates an \(N \times N\) attention map. If you double the image size, the work quadruples. This is quadratic complexity (\(O(N^2)\)).

Mobile devices generally cannot handle this load efficiently.

The Rise of Linear Attention

To fix this, researchers developed Linear Attention. By using a mathematical “kernel trick,” they can change the order of matrix multiplication, reducing the complexity from quadratic (\(O(N^2)\)) to linear (\(O(N)\)). This is a huge win for speed.

However, Linear Attention comes with a drawback: High Entropy. Because it simplifies the relationship between tokens (removing the heavy Softmax function that sharpens focus), Linear Attention often struggles to distinguish between important and irrelevant background information. It gets “distracted” easily.

The Old Solution: Stacking

Previous attempts (like MLLA) tried to fix this by stacking a “Local Bias” layer (like a standard convolution) on top of a “Linear Attention” layer. The convolution handles local details (edges, textures), and the attention handles global context.

While this works, it is rigid. Every single feature channel has to go through the heavy convolution and the linear attention mechanism sequentially. This creates a computational bottleneck.

The CARE Transformer asks two critical questions:

Is stacking really the best way to combine local and global information?
Can we improve efficiency and accuracy simultaneously, rather than trading one for the other?

The answer lies in a new architecture: CARE (deCoupled duAl-interactive lineaR attEntion).

2. Background: The Math of Attention

Before looking at the new method, let’s briefly visualize the mathematical foundation.

Standard Self-Attention calculates the similarity between Queries (\(Q\)) and Keys (\(K\)), applies a Softmax function, and multiplies by Values (\(V\)).

Equation 1: Standard Self-Attention formula.

The exponential function inside the sum is what creates the heavy computational load.

Linear Attention removes the Softmax and uses a kernel function \(\varphi(\cdot)\) (often just the identity function) to linearize the cost.

Equation 2: Linear Attention formula.

While Equation 2 is much faster to compute, the lack of explicit similarity measurement (the softmax) is what leads to the “high entropy” problem mentioned earlier.

The “Stacked” Approach typically looks like this:

Equation 3: Stacked Local Bias and Linear Attention.

Here, the input \(\mathbf{X}\) is processed by a local bias block, then linear attention, then another local bias block. It is a serial process that consumes significant resources.

3. The Core Innovation: CARE Mechanism

The authors of the CARE Transformer propose that we shouldn’t force all features through the same pipeline. Instead, we should Divide and Conquer.

The core of the paper rests on three pillars:

Asymmetrical Feature Decoupling: Splitting the work.
Dual Interaction: Mixing the results.
Dynamic Memory Unit: Remembering the past.

Let’s break these down.

A. Asymmetrical Feature Decoupling

Instead of a serial stack, the CARE mechanism splits the input features along the channel dimension. Imagine your input \(\mathbf{X}\) has \(d\) channels (e.g., 64 channels).

The model splits \(\mathbf{X}\) into two parts:

\(\bar{X}\) (Global Branch): Sent to the Linear Attention module.
\(\tilde{X}\) (Local Branch): Sent to a Convolutional (Local Bias) module.

Crucially, this split is asymmetrical. The authors allocate fewer channels (\(d_1\)) to the expensive Linear Attention branch and more channels (\(d_2\)) to the efficient Convolution branch, such that \(d_1 < d_2\).

Figure 2. Comparisons between (a) the stacked learning approach and (b) our asymmetrical decoupled learning way.

Why is this brilliant? Look at Figure 2 above.

Left (a): The stacked approach forces the full depth (\(d\)) through both operations.
Right (b): The decoupled approach runs them in parallel on smaller subsets of data.

Mathematically, the complexity of Linear Attention is roughly proportional to \(d^2\) (the square of the channel dimension). By reducing the number of channels sent to the attention module (\(d_1\)), we drastically reduce the computational cost.

The mathematical formulation for this split is:

Equation 8: Splitting the input and processing separately.

The complexity reduction is proven analytically. If we analyze the cost \(\Omega\), we see that decoupling reduces the heavy lifting required by the projections and multiplications in the attention mechanism.

Equation 9: Complexity calculation showing the reduction based on lambda scalars.

Here, \(\lambda\) represents the scaling factors derived from the split. Because the relationship is quadratic, a small reduction in channel width leads to a large reduction in GMACs (Global Multiply-Accumulate Operations).

By setting \(d_1 < d_2\) (Asymmetrical), the authors prove that the complexity is lower than if they had split them 50/50 (\(d_1 = d_2\)).

Equation 11: Proof that asymmetrical splitting is more efficient than symmetrical.

B. Dual Interaction

Splitting features is great for speed, but if the “Global” part never talks to the “Local” part, the model will fail to understand the image holistically.

This is where the Dual Interaction Module comes in.

Figure 3. Schematic illustration for CARE Transformers.

Refer to Figure 3(c) on the right side of the image above. The interaction happens in two stages (Inter1 and Inter2).

After processing the split features, the model doesn’t just concatenate them. It uses a specific interaction function involving convolutions to mix the information.

Equation 14: The generic interaction function.

The interaction module:

Concatenates the features (\(\mathbf{x} \oplus \mathbf{y}\)).
Normalizes them.
Uses a \(1 \times 1\) convolution to mix channels.
Uses a \(3 \times 3\) depth-wise convolution to mix spatial information.
Uses another \(1 \times 1\) convolution to project back.

This ensures that the “Long-Range” information learned by Linear Attention modulates and enriches the “Local” details learned by convolutions, and vice versa.

C. Dynamic Memory Unit

One limitation of standard feed-forward networks is that layer \(N\) only sees the output of layer \(N-1\). It doesn’t have a persistent “scratchpad” of information from much earlier layers.

CARE introduces a Dynamic Memory Unit (\(Z\)).

Equation 13: The update rule for features and memory.

As shown in Equation 12 (and Figure 3b), the output of a block isn’t just the features \(\mathbf{X}\); it also updates a memory state \(\mathbf{Z}\). This memory unit is passed along the network pipeline. It allows the model to preserve critical information across different layers dynamically.

For the very first block in a stage, the memory is initialized by combining information from the previous stage:

Equation 12: Memory initialization.

This creates a richer flow of information, effectively allowing deep layers to access context that might otherwise have been diluted.

4. Experimental Results

The theory sounds solid: reduce complexity by splitting channels, then mix them back together. But does it actually work?

The authors tested CARE on ImageNet-1K (classification), COCO (object detection), and ADE20K (segmentation).

Image Classification (ImageNet-1K)

The most striking result is the trade-off between Latency (speed) and Accuracy.

Figure 1. Visualized comparison of the balance between accuracy, latency, and GMACs.

In Figure 1, look for the red stars (CARE).

The X-axis is Latency (lower is better, moving left).
The Y-axis is Accuracy (higher is better, moving up).

You can see that the CARE models (red stars) form a “Pareto frontier” in the top-left corner. They are consistently more accurate than other models (like MobileViT, EdgeViT, and MobileOne) at the same speed, or significantly faster at the same accuracy.

Detailed Data:

Table 1. The performance of our models on ImageNet-1K.

Let’s highlight a comparison from Table 1:

CARE-S2 achieves 82.1% Top-1 Accuracy with only 1.9 GMACs.
Compare this to MLLA-T (a previous linear attention model). MLLA-T hits 83.5% accuracy but requires 4.2 GMACs—more than double the computational cost.
Compare it to EdgeViT-S, which uses 1.9 GMACs (same cost). EdgeViT only reaches 81.0% accuracy. CARE beats it by over 1%.

On an iPhone 13, CARE-S2 runs in 2.0ms, while MLLA-T takes 5.1ms. That is a massive difference for real-time mobile applications.

Object Detection and Semantic Segmentation

A good backbone must do more than just classify images; it needs to support dense prediction tasks like finding objects.

Table 2. The performance of our models on Object Detection and Segmentation.

In Table 2, we see similar trends:

Object Detection: CARE-S1 achieves a box AP (Average Precision) of 41.5 with only 5.4 GMACs. It rivals the performance of Swin-T (42.2 AP) but Swin-T is a heavyweight model requiring 24.2 GMACs. CARE is almost 5x more efficient.
Semantic Segmentation (ADE20K): CARE-S2 achieves 43.5 mIoU (mean Intersection over Union), which is extremely competitive with much larger models, while running significantly faster on GPUs (RTX 4090).

Why does it work? (Ablation Studies)

The authors performed “Ablation Studies”—removing parts of the model to see what breaks.

Is Asymmetrical Decoupling necessary? Yes.

Table 3. Ablation studies for feature decoupling strategy. (Note: Referring to data from paper text corresponding to decoupling)

“w/ Sym” (Symmetrical): If you split channels 50/50, the model is slower (2.0 GMACs vs 1.9) for the same accuracy.
“w/ Sta” (Stacked): If you use the old stacked approach, accuracy drops to 81.4% and latency increases.
“w/o Local”: If you remove the local bias entirely, accuracy plummets to 77.3%.

Is Dual Interaction necessary?

Table 4. Ablation studies for the dual interaction module.

Removing the second interaction block (Inter2) causes accuracy to drop from 78.4% to 76.5%.
Removing the Memory unit (Mem) drops accuracy significantly, proving that carrying information forward is crucial.

5. Conclusion and Implications

The CARE Transformer represents a significant step forward for mobile computer vision. By identifying that we don’t need to treat every feature channel equally, the authors created a “Divide and Conquer” strategy that fits perfectly with the constraints of mobile hardware.

Key Takeaways:

Split the Load: You can send a small portion of data to a global attention mechanism and the rest to a local convolution without losing accuracy.
Interaction is Key: Splitting is only effective if you have a robust mechanism (Dual Interaction) to mix the information back together.
Memory Matters: Keeping a dynamic memory state allows lightweight models to “punch above their weight” by retaining context.

For students and practitioners, this paper illustrates an important lesson in architecture design: Efficiency often comes from specialized processing. Rather than throwing more compute at a problem, intelligent routing of information—deciding what needs global context and what needs local texture analysis—can yield better results at a fraction of the cost.

With models like CARE, the possibility of running high-fidelity, real-time visual understanding on your smartphone (without draining the battery in minutes) is becoming a reality.

Unlocking Mobile Vision: How CARE Transformers Balance Speed and Accuracy#

1. The Bottleneck: Why Mobile Vision is Hard#

The Cost of Self-Attention#

The Rise of Linear Attention#

The Old Solution: Stacking#

2. Background: The Math of Attention#

3. The Core Innovation: CARE Mechanism#

A. Asymmetrical Feature Decoupling#

B. Dual Interaction#

C. Dynamic Memory Unit#

4. Experimental Results#

Image Classification (ImageNet-1K)#

Object Detection and Semantic Segmentation#

Why does it work? (Ablation Studies)#

5. Conclusion and Implications#