Beyond Vocabulary: How Teaching AI to "See" Words Unlocks Multilingual Powers

Language is a strange thing. If you speak English, the word “Love” is a familiar sequence of four letters. If you speak Greek, “αγάπη” carries the same emotional weight but looks completely different. If you speak Chinese, “爱” is a distinct logogram.

For humans, these are just symbols we associate with meaning. For modern Artificial Intelligence, however, these differences are a massive engineering headache.

Most Large Language Models (LLMs) like BERT or GPT rely on a fixed “vocabulary”—a giant lookup table where every word or sub-word is assigned a specific ID number. If a model encounters a word not in its vocabulary (an “Out-Of-Vocabulary” or OOV error), it struggles. To make a model multilingual, engineers usually have to bloat this vocabulary with tens of thousands of new tokens from different languages, requiring massive datasets and computational power.

But what if models didn’t need a vocabulary list? What if they could just “look” at the text as an image and understand it, regardless of the language?

This is the premise behind MTLS (Making Texts into Linguistic Symbols), a fascinating new paper by researchers from the Hefei University of Technology. They propose a method to strip away the rigid vocabulary of models like BERT and replace it with a visual, symbolic processing system. The result? A model trained on a small amount of English data that can suddenly process Chinese, Korean, and even Coptic—languages it has never effectively “seen” before.

In this post, we will tear down the vocabulary barrier and explore how MTLS works, the mathematics behind its “SSS Embedding,” and why this might be the future of efficient multilingual AI.

The Vocabulary Bottleneck

To understand why MTLS is necessary, we first need to look at how standard Natural Language Processing (NLP) handles text.

In a traditional pipeline (like BERT), text is processed via Tokenization. The sentence “I love AI” might be split into ['I', 'love', 'AI']. The model looks these tokens up in a dictionary. I becomes ID 1045, love becomes ID 2293, etc. These IDs map to vectors (embeddings) that the model processes.

The problem arises when you cross borders. As shown in the figure below, if a model’s dictionary only contains English, the Greek word “αγάπη” is invisible to it. It triggers an OOV error.

Figure 1: A brief overview of MTLS. (a) illustrates the benefits of employing mapping relations between linguistic symbols and text. (b) demonstrates the metasymbol system can serve as a bridge between linguistic symbols and the embedding space.

Figure 1(a) illustrates this limitation. The traditional lookup method fails when the language changes. However, Figure 1(b) introduces the MTLS concept: a Meta-Symbol System.

The researchers argue that while symbols differ, the “semantic core” (the universal meaning) is shared. By treating words as Linguistic Symbols (pixel images rendered from text) rather than dictionary IDs, we can map different visual representations (Russian, Korean, Chinese, Arabic) into a shared embedding space without needing a predefined vocabulary list for every language on Earth.

The Solution: SSS Embedding

The core innovation of this paper is replacing the traditional embedding layer of a Pre-trained Language Model (PLM) with a new module called SSS Embedding.

SSS stands for:

Symbolic Embedding
Selective Embedding
Spatial Embedding

The goal is to take a raw image of text and transform it into a vector that a standard model (like BERT) can understand, effectively “tricking” the model into processing languages it wasn’t trained on. Let’s break down the architecture.

1. Symbolic Embedding: Reading Pixels

The first step is moving from text files to pixels. The system renders words into fixed-size images.

Instead of analyzing the whole image at once, MTLS uses a strategy similar to Vision Transformers (ViT). It chops the image of the word into small square “patches” (e.g., \(16 \times 16\) pixels).

These patches act like the new “tokens.” A Convolutional Neural Network (CNN) scans these patches to extract visual features—curves, lines, and strokes—creating a sequence of vectors. If you have a sequence of words, the first patch of every word is used as the primary symbolic representation for that word.

2. Selective Embedding: The Mixture of Experts

Not all writing systems are the same. The dense, complex strokes of a Chinese character require different processing attention than the linear flow of Latin script. A single neural network might struggle to generalize across such diverse visual styles.

To solve this, the authors introduce Selective Embedding using a Mixture-of-Experts (MoE) mechanism.

Think of this as a team of specialists. You have \(N\) experts (represented as matrices), and for every piece of text, a “Gate” decides which experts are best suited to handle it.

Equation describing the Expert and Gate functions.

Here, \(x\) is the symbolic embedding from the previous step. The Gate function calculates a probability distribution—essentially asking, “How confident is Expert A vs. Expert B at handling this specific symbol?”

The system doesn’t use all experts for every word (which would be slow). Instead, it picks the Top-K experts.

Equation describing the calculation of bias embedding using TopK experts.

The outputs of these selected experts are summed up to create a bias embedding. This bias adds nuance to the raw visual features, allowing the model to adapt its representation based on the complexity or style of the script.

3. Spatial Embedding: Bridging the Gap

Now we have a rich visual representation of the text. But we have a problem: we want to plug this into a pre-trained model like BERT. BERT expects vectors that follow a very specific mathematical distribution (its “embedding space”). If we just shove our visual vectors into BERT, the model will output nonsense.

Spatial Embedding acts as the translator between the visual world and BERT’s semantic world. It uses a Transformer Encoder-Decoder structure to align these spaces using two specific loss functions during pre-training.

Step A: Distributional Similarity

First, the model ensures the “shape” of the data distribution matches. It compares the probability distribution of the new visual embeddings (\(P_h\)) against the original text embeddings (\(P_t\)) using Kullback-Leibler (KL) divergence.

Equation for Distributional Similarity Loss (KL Divergence).

This ensures that the general statistics of the visual embeddings look like the text embeddings BERT is used to.

Step B: Spatial Similarity

Next, the model uses Contrastive Learning. It forces the visual embedding of a word (e.g., an image of “Cat”) to be mathematically close to the original text ID embedding of “Cat”, while pushing it away from the embeddings of other words.

Equations for Spatial Similarity Loss using contrastive learning.

The final training objective combines these two goals:

Total Loss equation.

By minimizing this total loss, MTLS learns to project visual symbols into the exact spot in vector space where the semantic meaning resides.

Experiments: Doing More with Less

The experimental setup for this paper is surprisingly lean, which makes the results even more impressive.

Backbone Models: BERT and RoBERTa (monolingual versions).
Training Data: Only about 12,000 sentences of English data from the Universal Dependencies treebank.
No Multilingual Training: The model was not trained on Chinese, Arabic, or Korean text. It only saw English.

The researchers then tested the model’s ability to handle 20 different languages across varying language families (Indo-European, Sino-Tibetan, Afro-Asiatic, etc.).

Task 1: Part-of-Speech (POS) Tagging

In this task, the model must identify if a word is a noun, verb, adjective, etc. The researchers tested this in a “Zero-Shot” setting—meaning the model was fine-tuned on English POS tags and then immediately asked to tag words in other languages without any further training.

Table 1: Results of the POS tagging task. Comparison of mBERT, XLM-R, Standard BERT, and MTLS-BERT.

Key Takeaways from Table 1:

Massive Jump over Baseline: Look at the “Zero-Shot” columns. Standard BERT scores a dismal 14.3% on Chinese (ZHO) and 15.6% on Arabic (ARA). This is expected; standard BERT doesn’t know these alphabets.
MTLS Improvement: MTLS-BERT jumps to 28.9% (ZHO) and 17.5% (Coptic - COP).
The Coptic Surprise: Look at the column COP (Coptic). The massive multilingual models, mBERT and XLM-R, score roughly 5%. They fail because Coptic is a low-resource language often missing from their massive training sets. MTLS, however, scores 17-18%. Because MTLS reads symbols visually, it generalizes better to unseen scripts than models relying on fixed vocabularies.

Task 2: Named Entity Recognition (NER)

NER involves finding names, locations, and organizations in text. This requires semantic understanding, not just grammatical syntax.

Table 2: Results of the NER task. Comparison of models across varying languages.

Key Takeaways from Table 2: Again, we see the “Zero-Shot” capabilities shine. Standard BERT gets 1.6% on Chinese. MTLS-BERT gets 2.2%. While the absolute numbers are low (NER is hard!), the relative improvement is distinct.

However, there is a trade-off. Notice the ENG (English) and VIE (Vietnamese) scores. MTLS performs slightly worse on Latin-script languages than the original BERT. This is the cost of replacing the highly optimized dictionary look-up with a visual estimation. The model gains breadth (more languages) but loses some precision in its native script.

Why does it work? (Ablation Study)

Is the complex SSS architecture actually necessary? The researchers disabled parts of the model to check.

Table 3: Ablation study results showing performance drops when removing components.

w/o PT (No Pre-training): Performance collapses. The mapping between visual and semantic space must be learned.
w/o SE (No Selective Embedding): Performance drops significantly. The “Mixture of Experts” is crucial for handling different writing styles.
w/o SSL (No Spatial Similarity): The model fails almost completely. Simply matching distributions isn’t enough; contrastive learning is required to lock the embeddings in place.

Efficiency and Parameter Analysis

One of the strongest arguments for MTLS is efficiency regarding model size. Multilingual models are usually enormous because their embedding layers (the vocabulary list) must be massive to cover thousands of languages.

Figure 3: Parameter comparison between SSS embedding in MTLS and embeddings of PLMs.

As Figure 3 shows, the embedding layer for XLM-R (a popular multilingual model) is nearly 200 Million parameters. That’s just the dictionary!

In contrast, the SSS Embedding (the blue bars) is significantly smaller, sitting around 85 Million parameters. It is larger than a monolingual BERT embedding, but much smaller than a multilingual one, while offering the theoretical ability to handle any language that can be rendered as an image.

Is Visual Text the Future?

The paper also explores what happens if you only use the Symbolic Embedding (just the CNN part) without the complex Selective or Spatial mappings.

Figure 4: Results of whether or not to use symbolic embedding in the multilingual POS tagging task.

Figure 4 shows that simply swapping text for images (BERT-SE) doesn’t work well on its own. The heavy lifting is done by the Spatial Embedding—the intelligent mapping of those images into a semantic space.

Conclusion

The “MTLS” paper presents a paradigm shift. For decades, NLP has been text-based. We assumed that to understand language, a computer must process discrete text characters. MTLS proves that a computer can “read” by looking at pixels, just as humans do.

By treating text as Linguistic Symbols, the authors:

Eliminated the Out-Of-Vocabulary (OOV) problem.
Enabled a model trained only on English to process Coptic, Chinese, and Arabic.
Reduced the parameter bloat associated with multilingual vocabulary lists.

While there is still a performance gap compared to models trained on massive multilingual corpora, MTLS offers a promising path for low-resource languages. For languages that don’t have enough internet data to build a massive vocabulary, simply “looking” at the text might be the key to joining the AI revolution.

This post is based on the research paper “MTLS: Making Texts into Linguistic Symbols” by Wenlong Fei et al.

The Vocabulary Bottleneck#

The Solution: SSS Embedding#

1. Symbolic Embedding: Reading Pixels#

2. Selective Embedding: The Mixture of Experts#

3. Spatial Embedding: Bridging the Gap#

Step A: Distributional Similarity#

Step B: Spatial Similarity#

Experiments: Doing More with Less#

Task 1: Part-of-Speech (POS) Tagging#

Task 2: Named Entity Recognition (NER)#

Why does it work? (Ablation Study)#

Efficiency and Parameter Analysis#

Is Visual Text the Future?#

Conclusion#