Probing with a Scalpel: Finding What Language Models Already Know

Large language models like BERT are the powerhouses of modern Natural Language Processing (NLP). They can write essays, translate languages, and answer questions with stunning fluency. But for all their power, they remain mysterious. We know they learn from vast amounts of text, but what do they actually learn about language? Do they understand syntax, semantics, and grammar in a way we can recognize?

To answer these questions, researchers use a technique called probing. The idea is simple: pick a specific linguistic property, like part-of-speech tagging, and see if you can predict it using the model’s internal representations. The standard approach is to freeze the language model and train a small classifier—usually a Multi-Layer Perceptron (MLP)—on top of its hidden states. If this “probe” performs well, we conclude that the model’s representations contain information about that linguistic property.

But this approach has a nagging problem. How do we know the probe is just finding pre-existing knowledge and not learning the task from scratch? A powerful MLP probe might be smart enough to solve the task on its own, telling us little about the underlying model. This creates a tricky trade-off: a simple probe might be too weak to find complex patterns, while a complex probe might be too sophisticated to be faithful.

A 2021 paper, “Low-Complexity Probing via Finding Subnetworks,” proposes a beautifully simple and effective solution. Instead of adding a new model on top, what if we could find a small piece of the original model that can already perform the task? Instead of building on, they subtract. This article dives into this clever subnetwork probing approach, explaining how it works, why it’s more reliable, and what it reveals about the inner workings of models like BERT.

The Problem with Traditional Probing

Before we get to the new method, let’s first pin down the goal. A good probe should be faithful to the model we’re studying. That means it should have two key traits:

High Accuracy on Trained Models: It should be strong enough to detect linguistic information if the model has learned it.
Low Accuracy on Random Models: It should fail spectacularly if the model has no learned structure (such as an untrained model). A probe that performs well on random weights isn’t revealing anything about the model—it’s just learning the task from scratch.

The problem is that standard MLP probes often struggle with this balance. Even a one-layer MLP can add thousands or millions of parameters, making it powerful enough to learn independently of the model’s representations. This “complexity” muddies interpretation: are we measuring the model’s stored knowledge or the probe’s ability to learn? That’s the dilemma that subnetwork probing aims to solve.

The Core Idea: Don’t Add, Subtract

The authors flip the traditional probing paradigm on its head. Instead of asking, What can we build on top of these representations? they ask, Is there a smaller circuit within the existing network that already captures this property?

They propose finding a subnetwork—a version of the original model where most of the weights are set to zero—that performs the task of interest. The probe isn’t a new classifier; it’s a mask identifying which of the original weights to keep and which to discard.

Because this approach uses only the original parameters, it’s inherently low-complexity. It can’t “invent” solutions—it must locate them within the model itself. This makes the method far more faithful: the probe can’t cheat by learning the task anew; it can only reveal knowledge that’s already encoded.

How to Find a Subnetwork

Turning off model weights arbitrarily isn’t feasible—the combinations are astronomical. The authors introduce a clever optimization strategy built around a continuous relaxation of the binary mask.

Let’s denote the original model’s weights as \( \phi \) and the mask as \( Z \). The subnetwork’s weights are \( \phi * Z \), where each \( Z_i \) is ideally 0 or 1. Since we can’t use gradients on discrete values, they approximate the binary mask using the Hard Concrete distribution—a continuous, differentiable distribution that behaves like a binary switch.

For each weight \( \phi_i \), there’s a learnable parameter \( \theta_i \) that determines the probability of keeping that weight. During training, these probabilities gradually sharpen so most mask values become close to either 0 or 1.

The Hard Concrete distribution equations allow for a differentiable approximation of a binary mask.

The Hard Concrete distribution enables gradient-based optimization of binary masks that select which weights remain active.

This soft masking technique lets the system “learn” which parts of the model matter for the given linguistic task using standard backpropagation.

Training the mask involves two competing forces:

Minimize Task Loss: The subnetwork should correctly perform the task—such as POS tagging or parsing. The loss function is calculated over an expectation of subnetworks sampled from the Hard Concrete distribution.

The expected task loss is computed over samples of possible subnetworks, ensuring mask learning benefits from realistic structure.

Encourage Sparsity: To ensure the subnetwork is truly small, a regularization term penalizes masks that keep too many weights.

Regularization pushes mask values toward zero, forcing the probe to rely on as few active weights as necessary.

By balancing these objectives, training discovers subnetworks that are both accurate and sparse. The final probe consists solely of the learned mask—no additional classifier, no new parameters.

Putting Subnetwork Probing to the Test

To evaluate their method, the authors compared the subnetwork probe with a standard single-layer MLP probe on a bert-base-uncased model. They tested both methods across three fundamental NLP tasks:

Part-of-Speech (POS) Tagging: Assigning grammatical roles to words.
Dependency Parsing: Identifying syntactic relationships between words.
Named Entity Recognition (NER): Detecting named entities such as people, organizations, and places.

The Sanity Check: Probing Random Models

A faithful probe should succeed when the model is trained and fail when the model is random. To verify this, the researchers tested three model configurations:

Pre-trained model: The standard, trained BERT.
Reset encoder: Transformer layers re-initialized randomly, but word embeddings kept (word-level information only).
Reset all: A completely untrained, random model.

Table 1 shows the performance of subnetwork, MLP-1, and fine-tuning probes on pre-trained and reset models. Substantially better numbers are bolded.

Probe accuracy across tasks and model variants. Subnetwork probes outperform MLP-1 on trained models while dropping sharply on random models—indicating stronger faithfulness.

The results are compelling:

On pre-trained models, subnetwork probes consistently outperform MLP probes, uncovering richer linguistic information.
On random or reset models, subnetwork accuracy plummets, while MLP accuracy remains relatively high—meaning the MLP can partly learn the task from scratch.

This contrast shows that the subnetwork probe reflects the model’s internal knowledge rather than its own learning ability—a hallmark of a faithful probe.

The Main Event: Accuracy vs. Complexity

Beyond raw accuracy, the real test is the accuracy–complexity tradeoff. Probe complexity is measured by the number of bits required to describe the probe’s parameters. Lower complexity means a smaller, simpler probe.

For subnetworks, complexity is varied by masking weights at different granularities (entire layers, neurons, or individual weights). For MLP probes, it’s varied by changing the rank of their hidden layer.

The results, plotted as accuracy vs. complexity curves, are striking.

Figure 1 shows that for any given complexity, the subnetwork probe achieves higher accuracy than the MLP probe.

Across POS tagging, dependency parsing, and NER, subnetwork probes (blue line) outperform MLP probes (red dashed line) for every complexity level. The blue curve lies entirely above the red, demonstrating Pareto dominance.

The blue curve (subnetwork probe) sits consistently above the red (MLP), showing Pareto dominance—subnetwork probing achieves higher accuracy for any given complexity. In fact, for tasks like POS and dependency parsing, subnetworks reach high performance with as few as 72 bits—masking each of BERT’s weight matrices once. The MLP probe needs roughly 20,000 bits to achieve similar performance.

This demonstrates that subnetwork probing isn’t only faithful—it’s efficient.

Peeking Inside: What Do the Subnetworks Reveal?

Beyond performance metrics, subnetworks offer insight into where linguistic knowledge resides inside BERT. By examining which weights remain active after pruning, we can see which layers are most vital for specific tasks.

Figure 2 shows the distribution of non-zero weights across BERT’s layers for different linguistic tasks.

Layer-wise distribution of non-zero weights for different tasks. The pre-trained model shows strong task-specific localization, while reset models display uniform, low activation.

On the right, the “reset encoder” model shows evenly distributed active weights—there’s no learned structure to uncover. But on the left, where the model is pre-trained, a clear hierarchy emerges:

Lower layers (Part-of-Speech): Focused on word-level syntax.
Middle layers (Dependencies): Capture grammatical relationships.
Upper-middle layers (NER): Specialize in semantic roles and named entities.

This progression mirrors the traditional NLP pipeline: from low-level syntax to mid-level parsing to high-level semantics. What’s remarkable is that the subnetwork probe finds this hierarchy automatically through pruning—no extra classifier training required.

Conclusion: A More Faithful Way to Understand Models

The “Low-Complexity Probing via Finding Subnetworks” paper marks a meaningful advance in our quest to understand deep language models. By reframing probing as a subtractive rather than additive process, it achieves three major benefits:

More Faithful: It captures properties encoded by the model itself, not those learned by the probe.
More Efficient: It achieves higher accuracy for any given complexity, Pareto-dominating standard probes.
More Interpretable: It exposes where specific linguistic abilities reside within the model.

As models grow larger and more opaque, methods like subnetwork probing—scalpels rather than sledgehammers—will be key to turning black-box systems into interpretable ones. They promise not just better performance insight, but a deeper understanding of what these models truly know.

The Problem with Traditional Probing#

The Core Idea: Don’t Add, Subtract#

How to Find a Subnetwork#

Putting Subnetwork Probing to the Test#

The Sanity Check: Probing Random Models#

The Main Event: Accuracy vs. Complexity#

Peeking Inside: What Do the Subnetworks Reveal?#

Conclusion: A More Faithful Way to Understand Models#