How Mirror Symmetry Can Slice Your Neural Network's Compute in Half

If you have been following the trajectory of deep learning over the last decade, you are likely familiar with “The Bitter Lesson” by Rich Sutton. The core argument is simple: historically, the only technique that consistently matters in the long run is scaling computation. Human ingenuity—hand-crafting features or encoding domain knowledge—eventually gets crushed by massive models trained on massive compute.

But what if we could use human knowledge to improve how we scale computation?

This is the central question behind the research paper “Flopping for FLOPs: Leveraging Equivariance for Computational Efficiency.” The authors challenge the idea that incorporating geometric priors (like symmetry) effectively “taxes” the computational budget. Instead, they propose a method to build neural networks that respect symmetry (specifically mirror symmetry) while cutting the number of floating-point operations (FLOPs) per parameter in half.

In this post, we will dive deep into how this works. We will explore the mathematics of equivariance, how to block-diagonalize linear layers, and how to adapt modern architectures like Vision Transformers (ViT) and ConvNeXts to run faster by simply understanding that a butterfly looks the same when reflected in a mirror.

The Problem with Symmetry and Scale

In computer vision, we know that objects generally retain their identity regardless of orientation. A cat facing left is the same class as a cat facing right.

Figure 1: Common image classification tasks are invariant to flopping (horizontal mirroring). In our implementation, this invariance is enforced through equivariant network layers, halving the required floating-point operations (FLOPs).

Standard neural networks, like a vanilla ResNet or ViT, don’t inherently “know” this. They have to learn it by seeing thousands of left-facing cats and thousands of right-facing cats. This is inefficient in terms of parameters (the weights needs to store information for both orientations).

To solve this, researchers developed Equivariant Neural Networks. These networks hard-code symmetry into the architecture. They guarantee that if you transform the input (e.g., rotate or flip it), the internal features transform in a predictable way.

Historically, there has been a catch. While equivariant networks are parameter efficient (they need fewer weights to learn the same task), they are often compute heavy. In many implementations, enforcing symmetry requires weight sharing that results in dense computations. You might save memory on parameters, but you pay for it in FLOPs (Floating Point Operations) and wall-clock time.

The researchers behind “Flopping for FLOPs” asked: Can we design an equivariant network that actually reduces the computational burden?

The Core Solution: Flopping and Block Diagonals

The specific symmetry this paper tackles is horizontal mirroring, which the authors charmingly refer to as “flopping.”

The intuition is to stop thinking about features as generic numbers and start thinking about them as having geometric “types.” When you look at a feature map in a neural network, the authors propose splitting the channels into two distinct groups:

Invariant Features (\(x_1\)): These features remain unchanged when the input image is flopped.
(-1)-Equivariant Features (\(x_{-1}\)): These features flip their sign (multiply by -1) when the input image is flopped.

This seemingly simple distinction has profound implications for the linear layers (the dense matrix multiplications) that make up the bulk of computation in Transformers and MLPs.

The Math of Efficiency

Let’s look at a standard linear layer. Usually, you have an input vector \(x\) and a weight matrix \(W\) to produce an output \(y\). If \(W\) is a dense matrix, every input connects to every output.

However, if we enforce equivariance, we are establishing rules. We want the invariant outputs (\(y_1\)) to depend only on interactions that preserve invariance.

If an invariant feature interacts with a (-1)-equivariant feature, the result is (-1)-equivariant (mathematically analogous to \(positive \times negative = negative\)). Therefore, an invariant output cannot be formed by a (-1)-equivariant input unless the weight is zero.

This restriction forces the weight matrix to become block-diagonal.

Equation showing the block diagonal decomposition of the weight matrix.

In the equation above:

\(W_{1,1}\) maps invariant inputs to invariant outputs.
\(W_{-1,-1}\) maps (-1)-equivariant inputs to (-1)-equivariant outputs.
The off-diagonal blocks (which would map invariant to equivariant or vice versa) must be zero.

Why This Saves Compute

This is where the magic happens. A standard matrix multiplication of size \(N \times N\) requires roughly \(N^2\) operations.

By splitting the features evenly (half invariant, half equivariant) and forcing the off-diagonals to zero, we perform two smaller matrix multiplications of size \((N/2) \times (N/2)\).

\[ 2 \times \left( \frac{N}{2} \right)^2 = 2 \times \frac{N^2}{4} = \frac{N^2}{2} \]

We have just cut the number of FLOPs in half.

This result stems from Schur’s Lemma, a fundamental theorem in representation theory. It essentially says that linear maps between different types of irreducible representations (irreps) must be zero. By parametrizing the network in terms of these irreps (symmetric and antisymmetric features), the computational savings come naturally.

Building the Architecture

How do we implement this in practice? We can’t just apply this logic to a black box; we need to adapt the specific layers of modern vision architectures.

Figure 3: Schematic. Our architectures follow the basic geometric deep learning blueprint.

The schematic above illustrates the “Geometric Deep Learning Blueprint.” We start with an equivariant embedding, process features through equivariant blocks, and finally collapse them into invariant features for the classifier (since the final label “Cat” shouldn’t flip).

1. The Patch Embedding

Everything starts with the input image. To enter our equivariant pipeline, we need to convert the raw RGB pixels into our two feature types: invariant and (-1)-equivariant.

We do this using a modified Patch Embedding layer.

Figure 2: Patch embedding layer illustration showing symmetric and antisymmetric filters.

As shown in Figure 2, the authors enforce constraints on the convolutional filters:

Symmetric Filters: Produce invariant features.
Antisymmetric Filters: Produce (-1)-equivariant features (they flip sign when the underlying patch is flipped).

This creates the initial split of the feature stream.

2. Non-Linearities (GELU)

Linear layers are easy to block-diagonalize, but neural networks need non-linearities like GELU or ReLU to function. How do you apply a non-linearity to a feature that is supposed to flip signs? If you just apply ReLU to a negative number, it becomes zero, destroying the symmetry information.

The authors solve this by transforming the features. They temporarily mix the invariant (\(x_1\)) and equivariant (\(x_{-1}\)) parts to move into a “spatial” domain (\(s\) and \(t\)), apply the non-linearity there, and then transform back.

Equations for equivariant activation functions.

While this looks slightly more complex than a standard GELU(x), the computational cost is negligible compared to the savings gained in the linear layers.

3. Attention

For Vision Transformers (ViT), the Self-Attention mechanism is critical. The authors adapt this by splitting Queries (\(q\)) and Keys (\(k\)) into their symmetric and antisymmetric parts.

The dot product (which determines attention scores) becomes:

Equation for the equivariant attention mechanism.

Notice the term \((-q_{i,-1}) \cdot (-k_{j,-1})\). Because both terms flip signs, the negatives cancel out, and the resulting attention score \(a_{ij}\) remains invariant. This ensures that the attention map describes the relationship between patches regardless of the image orientation.

Adapting Modern Backbones

The authors didn’t just propose a theoretical module; they adapted three of the most popular computer vision architectures: ResMLP, ViT, and ConvNeXt.

ResMLP

ResMLP is an architecture that consists almost entirely of linear layers (MLPs). Since the authors’ method cuts linear layer compute by 50%, ResMLP benefits massively.

The residual block structure is modified to respect the feature splits:

Equations showing the structure of ResMLP blocks.

Because ResMLP is so dense-computation heavy, the “Flopping” technique results in the most dramatic speedups here.

ConvNeXt

ConvNeXt is a modernized Convolutional Network. The heavy lifting in ConvNeXt is done by depthwise convolutions followed by \(1 \times 1\) convolutions (which are mathematically identical to linear layers).

The \(1 \times 1\) convolutions get the block-diagonal treatment (50% savings). For the depthwise spatial convolutions, the authors alternate between symmetric and antisymmetric kernels.

Equations showing the structure of ConvNeXt blocks.

While they didn’t write custom GPU kernels to fully optimize the depthwise layer (using Winograd algorithms for symmetry), the standard implementation still benefits significantly from the reduction in the pointwise layers.

Experiments: Does it Work?

The theoretical reduction in FLOPs is clear, but does it translate to performance? Sometimes, restricting a network to be equivariant forces it to learn “worse” features, dropping accuracy.

The authors tested their equivariant models (\(\mathcal{E}\)) against standard baselines on ImageNet-1K.

FLOPs vs. Parameters

First, let’s look at the scaling properties.

Figure 4: FLOPs scaling by model size. Comparing the number of FLOPs to the number of parameters in the model.

In standard networks, FLOPs and parameters usually scale together linearly. The dashed line in Figure 4 represents the Equivariant models. You can see they occupy a unique space in the efficiency landscape, maintaining the parameter count of larger models while drastically reducing the compute requirements.

Accuracy vs. Compute

The most critical chart for any efficiency paper is the trade-off between Accuracy and Compute (FLOPs).

Figure 5: Validation accuracy on ImageNet-1K versus model complexity.

In Figure 5, look at the plot for ConvNeXt-ISO and ResMLP (bottom). The Equivariant models (black dotted line) consistently appear to the left of the baseline models for the same accuracy. This means they achieve comparable results with significantly less computation.

For DeiT III (ViT) (top chart), the results are also compelling. The equivariant versions (\(\mathcal{E}(ViT)\)) achieve accuracy very close to the baselines but at a fraction of the FLOPs. For example, \(\mathcal{E}(ViT-L)\) (Large) runs with roughly the compute budget of a standard ViT-B (Base) but achieves higher accuracy.

Throughput (Real Speed)

FLOPs are a theoretical measure. Real-world speed (images per second) depends on memory bandwidth and GPU kernel implementation.

Figure 6: Throughput gain based on model size.

Figure 6 shows the “Throughput Gain.”

ResMLP (Orange): Shows massive gains (up to 60% faster). Since ResMLP is dominated by big matrix multiplications, the block-diagonalization pays off immediately.
ViT and ConvNeXt: The gains increase as the model size grows. For small models (Small/Tiny), the overhead of managing the feature splits slightly outweighs the math benefits. But as models scale up (Large/Huge), the matrix multiplications dominate, and the equivariant models run significantly faster than their standard counterparts.

Key Takeaways

The “Bitter Lesson” suggests we should stop trying to be clever and just scale compute. This paper offers a “sweet” counter-argument: being clever about symmetry allows us to scale compute more effectively.

Here is the summary of what “Flopping for FLOPs” achieves:

Block-Diagonal Efficiency: By decomposing features into invariant and (-1)-equivariant types, dense linear layers are split into two smaller blocks, halving the FLOPs.
Universal Applicability: This isn’t just for niche architectures. It works on ViTs, MLPs, and ConvNets.
Scalability: The efficiency gains get better as the models get bigger. This makes it a promising technique for the era of Foundation Models.
No “Symmetry Tax”: Unlike previous equivariant approaches that increased compute density, this method reduces it while maintaining comparable accuracy.

By hard-coding the knowledge that “a flipped image is still the same image,” we don’t just save the model from having to learn that fact—we save the GPU from having to compute it.

The Problem with Symmetry and Scale#

The Core Solution: Flopping and Block Diagonals#

The Math of Efficiency#

Why This Saves Compute#

Building the Architecture#

1. The Patch Embedding#

2. Non-Linearities (GELU)#

3. Attention#

Adapting Modern Backbones#

ResMLP#

ConvNeXt#

Experiments: Does it Work?#

FLOPs vs. Parameters#

Accuracy vs. Compute#

Throughput (Real Speed)#

Key Takeaways#