Introduction

One of the great mysteries of modern artificial intelligence is the “black box” problem. We know that deep neural networks work—often surprisingly well—but we don’t always know how they represent the data they are processing. Does a model classify a bird because it sees wings, or because it hears a song, or because it detects a specific texture in the background?

To answer these questions, researchers have developed various tools to compare the internal representations of different models. A popular and intuitive method is called Model Stitching. The logic goes like this: if you can take the first half of Model A, stitch it to the second half of Model B, and the frankenstein-monster combination still works, then Model A and Model B must be “thinking” in similar ways. They must be functionally aligned.

It is a compelling idea. It suggests a “Platonic Representation Hypothesis”—the idea that as models get better, they all converge toward a shared, true understanding of reality.

But a recent research paper, “Functional Alignment Can Mislead: Examining Model Stitching,” throws a wrench into this machinery. The authors demonstrate that functional alignment is a deceptive metric. Through a series of clever experiments, they show that you can successfully stitch together models that look at completely different features, models that operate on different senses (audio vs. visual), and even models that are stitched to random noise.

In this deep dive, we will explore why functional alignment might be misleading us, and why a model that “fits” perfectly might be seeing a completely different world.

The Basics: What is Model Stitching?

Before we dismantle the concept, we need to understand how it works. Model stitching is a technique used to measure the similarity between two neural networks, let’s call them Model A (the Sender) and Model B (the Receiver).

Imagine you have two factories. Factory A builds cars, and Factory B builds cars. You want to know if their assembly lines differ. You decide to take a half-finished chassis from the middle of Factory A and shove it into the middle of Factory B’s line. If Factory B can finish the car successfully, you assume the two factories are doing roughly the same things up to that point.

In deep learning terms, we take the representations (activations) from an intermediate layer of the Sender, pass them through a simple linear transformation (the “stitch”), and feed them into the Receiver.

Diagram illustrating the stitching process between a Colour model and a Digit model.

As shown in Figure B.1 above, the process involves freezing the weights of the Sender and Receiver. We only train the “Stitch” layer (usually a \(1 \times 1\) convolution). If the stitched model achieves high accuracy, the representations are considered “compatible.”

The prevailing wisdom has been that if two models are compatible, they must capture similar semantic information. This paper challenges that assumption entirely.

Experiment 1: The Bias Trap

To prove that stitching can mislead us, the researchers first needed a controlled environment where they knew exactly what each model was looking for. They turned to a modified version of the classic MNIST dataset (handwritten digits), but with a twist: Colour.

They created several variations of the dataset to force models to learn specific “shortcuts” or biases:

  1. Correlated (Colour MNIST): The digit and the background color are perfectly correlated. For example, a “0” is always on a red background, a “1” is always on green, etc. A model could solve this by looking at the number or the color.
  2. Digit (Uncorrelated Colour): The background colors are random. The model must look at the shape of the number to solve the task.
  3. Colour (Uncorrelated Digits): The digits are random, but the target label matches the background color. The model must look at the color and ignore the shape.
  4. Colour-Only: Just patches of color. No shapes at all.

Colour swatches for the base colour_map. Digits are font-based labels here, not MNIST images.

The goal was to train different models on these different datasets and then see if they could be stitched together.

The Hypothesis vs. The Result

Intuitively, if you take a model trained only on Colour (which has learned to ignore shapes) and stitch it into a model trained only on Digits (which looks for shapes), the stitch should fail. The Sender is speaking the language of “Red and Green,” while the Receiver is listening for “Curves and Lines.”

However, the results contradicted this intuition.

Test accuracy and rank analysis for various stitching combinations.

Look at the graph in Figure 2b above. The researchers stitched various “Sender” models into a “Digit” Receiver.

  • The Blue Line: This is the Correlated model stitched in. It works perfectly.
  • The Green Line: This is the Colour-only model stitched into the Digit receiver.

Despite the fact that the Colour model knows nothing about shapes, and the Digit model expects shapes, the stitched model achieves near-perfect accuracy (close to 100%).

Why does this matter?

This proves that stitching cannot distinguish between different biases. You can have two models that use fundamentally different rules to make decisions—one looking at the background paint, the other looking at the written ink—and the stitching metric will tell you they are “aligned” and “similar.”

If we relied on stitching to verify if a model was safe or unbiased, we would be in trouble. A model relying on a dangerous racial or gender bias (a shortcut) could appear perfectly aligned with a model that relies on legitimate features.

Experiment 2: Stitching “Nothing”

The critics might argue: “Maybe the color model secretly learned some shape features?” or “Maybe the shape model secretly learned some color features?”

To rule this out, the researchers took the experiment to the extreme. They removed the “real” data entirely. They created a dataset called Clustered-Noise.

This wasn’t images of dogs or numbers. It was static. Specifically, they generated random vectors of noise clustered around specific points in mathematical space. If the “Sender” sends this noise to the “Receiver,” surely the Receiver will fail? The Receiver is a sophisticated neural network trained to recognize handwritten digits, not static.

Simulated Clustered-Noise data sample for ResNet-18 at Res Block 2.

The image above shows what this “data” looks like to the network—meaningless noise.

And yet, it stitched.

The researchers found they could stitch these representations of clustered random noise into a trained network and achieve high accuracy. This is the “emperor has no clothes” moment for functional alignment. If a representation of random noise is considered “similar” to a representation of handwritten digits, the metric of “similarity” effectively loses its meaning regarding semantic content.

What the stitch is actually doing is not “translating meaning.” It is simply mapping clusters. As long as the Sender separates the data into distinct blobs (even if those blobs are just noise types), and the Receiver expects distinct blobs, the linear stitch can learn to map Blob A to Blob B.

Scaling Up: Birds, Dogs, and Autoencoders

So far, we’ve discussed toy problems with simple digits. Does this hold up in the real world of massive, complex models?

The researchers scaled up their experiments to ResNet-50 models and complex datasets like ImageNet (identifying objects in photos) and Spectrograms (visual representations of audio) for bird song classification.

They attempted to stitch:

  1. ImageNet Models (Visual identification of objects).
  2. Birdsong Models (Audio identification of birds).
  3. Stylized ImageNet (Paintings/sketches, forcing shape bias).

Examples of inputs from 4 different tasks we consider.

The figure above visualizes the vast differences in these domains. A waveform of a bird song looks nothing like a photograph of a dog.

The Real-World Results

Despite these distinct modalities, the stitching was often successful.

Table showing results of stitching various models into and from ImageNet.

In the table above, look at the row “10-class ImageNet to Birdsong.” When stitching an ImageNet model (Sender) into a Birdsong classifier (Receiver), they achieved 88.4% accuracy (Linear layer).

This is profound. The ImageNet model is processing pixels of dogs and trucks. The Birdsong model is built to process audio frequencies. Yet, they are “functionally aligned.” This implies that the internal geometry of how they separate classes is mathematically similar enough to be mapped, even though the content is entirely unrelated.

Generative Models (Autoencoders)

Finally, the authors looked at generative models—autoencoders designed to compress an image and then recreate it. They trained one autoencoder on Fashion-MNIST (clothing) and one on MNIST (digits).

They stitched the encoder of the clothing model to the decoder of the digit model.

Examples of reconstructions when stitching autoencoders trained on different datasets.

The result (Figure 4) is fascinatingly weird. The model takes an input of a shirt or a shoe (Left side) and reconstructs a handwritten number. It successfully maps the “concept” of a specific shoe to the “concept” of a specific number.

While this allows for cool tricks, it confirms the paper’s thesis: the representations are not semantically the same. A shoe is not a number. But because the topology (the shape of the data clusters) is similar, they can be aligned.

Discussion: The “Cluster” Hypothesis

If these models aren’t learning the same things, why does stitching work so well?

The authors propose that functional alignment is largely checking for linear separability and clustering.

Deep learning models are essentially machines that pull data apart. They take a messy cloud of data and stretch/twist it until all the “Dogs” are in one corner and all the “Cats” are in another.

  • Model A separates “Dogs” and “Cats.”
  • Model B separates “Bird Songs” and “Cricket Chirps.”

If Model A creates two clean clusters, and Model B creates two clean clusters, a simple linear layer (the stitch) can easily map the “Dog” cluster to the “Bird Song” cluster.

This doesn’t mean dogs are bird songs. It just means both models are good at organizing their respective data into piles.

To prove this, the researchers revisited the Clustered-Noise experiment. They increased the radius of the noise clusters, making them fuzzier and less distinct.

Table showing that varying the radius of noise clusters affects accuracy.

As shown in the table above, as the radius of the noise increases (making the clusters less separable), the stitching accuracy drops. This supports the theory: stitching measures how well-separated the data is, not what the data is.

Conclusion

The paper “Functional Alignment Can Mislead” serves as a crucial sanity check for the deep learning community. As we strive to understand large models (LLMs, Vision Transformers), we are desperate for metrics that tell us if models are “aligned” with human values or with each other.

The key takeaways are:

  1. Stitching measures compatibility, not semantic similarity. A successful stitch does not mean two models represent the world in the same way.
  2. Shortcuts are invisible to stitching. A model cheating by using background colors can look identical to a model actually recognizing shapes.
  3. Separability is King. Functional alignment seems to primarily detect if data is well-clustered, regardless of what that data represents (even if it’s random noise).

This doesn’t mean model stitching is useless. It is a powerful tool for modularity and transfer learning. However, interpreting it as a measure of “shared understanding” or “semantic equivalence” is dangerous. Two models can arrive at the same answer (high accuracy) using completely different, and potentially incompatible, logic. Just because the pieces fit, doesn’t mean they belong to the same puzzle.