Imagine trying to read a book not by recognizing letters or words, but by looking at a continuous screenshot of the pages. This is essentially how Pixel-based Language Models work. Instead of breaking text down into a vocabulary of “tokens” (like subwords or characters) as models like BERT or GPT do, these models treat text as images.

Why would we do this? The standard approach of using subwords creates a “vocabulary bottleneck.” If you want a model to understand 100 languages, you need a massive vocabulary list that competes for space. Pixel-based models bypass this entirely. If a script can be rendered on a screen, the model can process it.

But this raises a fascinating question: Does a model trained on pictures of text actually learn language, or is it just really good at pattern matching visual shapes?

In this post, we are breaking down a paper titled “Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models.” The researchers probe the inner workings of PIXEL, a vision transformer trained on text, to see where it sits on the spectrum between a vision model and a language model.

Background: The Players

To understand the study, we need to introduce three distinct models that the researchers compared:

  1. BERT: The classic “Linguist.” It uses subword tokenization and is optimized to understand syntax and semantics.
  2. ViT-MAE: The “Artist.” This is a standard Vision Transformer (ViT) designed to process images (like photos of dogs or cars). It doesn’t know what text is; it just sees pixels.
  3. PIXEL: The “Hybrid.” This uses the architecture of the Artist (ViT-MAE) but is trained on the data of the Linguist (rendered text from Wikipedia).

The Method: “Probing” the Brain

How do you know what a neural network “knows”? You use a technique called probing.

Deep learning models process data in layers. Layer 1 sees raw input, and Layer 12 (usually the final layer) outputs the final representation. By freezing the model and attaching a small classifier to the output of a specific layer (say, Layer 5), we can test if that specific layer holds the answer to a specific question.

The researchers used a suite of tasks to probe these models, ranging from simple visual checks to complex grammar tests.

Table 1: Description of probing tasks used in this study.

As shown in Table 1, the tasks are split into:

  • Surface: Simple features like sentence length.
  • Syntactic: Grammar rules, like detecting if two words have been swapped.
  • Semantic: Meaning-based tasks, like identifying odd semantic fits or tense.
  • Visual: New tasks created for this paper to see if the model recognizes specific characters purely by shape.

RQ1: How Much Language Does PIXEL Know?

The core hypothesis is that PIXEL, being a vision model, starts by seeing shapes and eventually “learns” to read as the data moves up through its layers.

To test this, the authors compared the probing performance of PIXEL, BERT, and ViT-MAE across their 12 layers.

Figure 1: Linguistic probing results for layers 1-12 of PIXEL, BERT and VIT-MAE, along with the majority baseline.

Figure 1 tells a compelling story. Let’s break it down:

  1. The Baseline (Green Line): ViT-MAE (the vision model) stays flat near the bottom. It doesn’t understand linguistics because it was never trained on language data.
  2. The Expert (Blue Line): BERT starts strong. Even at Layer 1, it knows a lot about syntax and semantics because its input (tokens) already carries linguistic information.
  3. The Learner (Orange Line): PIXEL shows a monotonic rise. In the lower layers (1-4), it performs similarly to the vision model. It is just processing visual patches. But as the data moves to higher layers, PIXEL “wakes up” linguistically. It begins to understand syntax and semantics, closing the gap with BERT.

The “Surface” Feature Anomaly

Look closely at the “Surface” graphs in Figure 1 (top left). BERT actually gets worse at predicting sentence length or word content as it goes deeper. This is normal—BERT abstracts away from surface details to focus on meaning.

PIXEL, however, struggles with Word Content (WC) in the early layers. Why? Because PIXEL doesn’t see “words.” It sees patches of \(16 \times 16\) pixels.

Figure 2: Example of “cool” being rendered differently in different contexts for PIXEL. The red lines represent patch boundaries.

Figure 2 illustrates the “Patch Problem.” In the top example, the word “cool” is split across three patches. In the bottom example, because of spacing, “cool” falls into different patch boundaries.

To PIXEL, the word “cool” looks visually different every time it appears depending on its alignment. The model has to spend its early layers just figuring out that these two different visual patterns represent the same word. This confirms that PIXEL starts as a visual model and transforms into a language model through its layers.

RQ2: Does It Forget the Visuals?

If PIXEL becomes a language model in the upper layers, does it lose its ability to “see”? To test this, the researchers created visual tasks (counting characters) and even tested the models on MNIST (handwritten digits).

Figure 3: Visual probing results for layers 1-12 of PIXEL, VIT-MAE and BERT.

In Figure 3, we see the results for counting characters.

  • BERT (Blue): As expected, BERT forgets visual surface information as it goes deeper.
  • ViT-MAE (Green): Being a pure vision model, it retains high visual accuracy.
  • PIXEL (Orange): PIXEL retains much more surface-level visual information than BERT, behaving almost like ViT-MAE even in higher layers.

This suggests that while PIXEL learns language, it doesn’t “abstract away” the visual details as aggressively as BERT abstracts away the tokens.

The MNIST Test

Can PIXEL still recognize images? The researchers fed the models handwritten digits (MNIST).

Figure 4: MNIST probing results for layers 1-12 of PIXEL and VIT-MAE.

Figure 4 shows that while PIXEL (Orange) is decent, it consistently underperforms the pure vision model (ViT-MAE). This reveals a trade-off: PIXEL’s pre-training on text has specialized it. It is no longer a general-purpose vision model; its “visual cortex” has been tuned specifically for the shapes of letters and words, making it slightly worse at general image tasks like identifying handwritten numbers.

RQ3: Can We Help PIXEL Read Faster?

We established that PIXEL wastes valuable computing power in the early layers just trying to figure out where one word ends and another begins (the “cool” problem).

The researchers asked: What if we make word boundaries obvious in the input image?

They experimented with “Orthographic Constraints” or rendering strategies:

  1. PIXEL-base: Standard continuous text rendering (blocks of text).
  2. PIXEL-words: The text is rendered so that a pixel patch never overlaps a word boundary. White space is added to separate words visually.

They tested “small” versions of these models to see if this helped.

Figure 5: Selected linguistic probing results for layers 1-12 of small PIXEL variants. Base models are indicated with dotted lines.

Figure 5 shows the impact of these strategies on small models.

  • PIXEL-small (Orange Dashed): Fails to learn much linguistics. It stays near the baseline.
  • PIXEL-small-words (Purple Dash-Dot): This model performs significantly better.

By forcing the visual patches to align with linguistic units (words), PIXEL-small-words overcomes the visual ambiguity early on. It behaves more like BERT in the early layers because the input is structured. This proves that helping the model see word boundaries allows it to focus on semantics much earlier in the network.

Conclusion: The Spectrum of Understanding

This paper provides a map of where Pixel-based models sit in the AI landscape.

  1. PIXEL is a hybrid. It starts processing like a vision model (dealing with edges and shapes) and gradually builds up linguistic abstractions (syntax and semantics) in later layers.
  2. It retains the “image.” Unlike BERT, which discards surface details to focus on meaning, PIXEL keeps the visual information alive deep into the network.
  3. Rendering matters. If we want Pixel-based models to compete with BERT, we need to think about how we present the text. Aligning pixels with word boundaries acts as a massive shortcut for the model’s learning process.

Why does this matter? Pixel-based models are the key to truly universal language models. They don’t need custom tokenizers for every language on Earth. If we can optimize them—perhaps by using the rendering tricks highlighted in this paper—we could build models that understand any language that can be written down, without ever needing a vocabulary list.

This blog post explains the findings from “Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models” by Tatariya et al. (2024).