Do LLMs Think in a Universal Language? Decoding Concept Space Alignment
When you ask a multilingual Large Language Model (LLM) like Llama-2 or BLOOMZ to translate a sentence from English to French, or to reason about a concept in Japanese, what is actually happening under the hood?
Does the model maintain separate “brains” for each language, or has it developed a shared, universal “concept space” where the idea of a “dog” is stored in the same mathematical location, regardless of whether it is called “dog,” “chien,” or “inu”?
This is a fundamental question in Natural Language Processing (NLP). We know these models work, but we are still figuring out why they generalize so well across languages. A fascinating research paper titled “Concept Space Alignment in Multilingual LLMs” by Qiwei Peng and Anders Søgaard investigates this exact phenomenon.
In this post, we will tear down their research to understand how LLMs organize concepts, how we can measure the “shape” of these languages mathematically, and why the way we prompt a model might actually be breaking its internal logical structure.
The Core Problem: Implicit Alignment
Historically, to make a computer understand that English “cat” and Spanish “gato” are related, we had to explicitly train them on parallel data (sentences translated between languages). However, modern LLMs are often trained on massive piles of monolingual text—just terabytes of English, followed by terabytes of French, and so on.
Despite this, they show remarkable cross-lingual abilities. The researchers hypothesize that this is due to implicit vector space alignment. In simple terms, they suspect that the model’s internal compression forces it to organize concepts in different languages in geometrically similar ways. If the vector for “King” minus “Man” equals “Queen” in English, the same geometric relationship should ideally exist in French.
If this hypothesis holds true, we should be able to take the entire geometric “shape” of the English concept space and simply rotate it to match the French space perfectly using a linear transformation.
The Setup: Defining “Concepts”
To test this, the researchers didn’t just use random words. They used WordNet synsets. A synset is a group of data elements that are considered semantically equivalent. This helps distinguish between the “bank” of a river and the “bank” where you keep money.
They collected 4,397 parallel concepts across 7 distinct languages:
- Indo-European: English (en), French (fr), Romanian (ro)
- Non-Indo-European: Basque (eu), Finnish (fi), Japanese (ja), Thai (th)
Figure 1: Examples of four parallel WordNet concepts, aligned across 7 languages.
As you can see in Figure 1, the goal is to align specific concepts like “failure” or “lizard” across these diverse languages. Note that they didn’t just pick easy languages; including Thai, Japanese, and Basque (a language isolate) provides a rigorous test for the models.
The dataset was split into training (seed dictionary) and testing sets, and further categorized into Abstract concepts (e.g., “happiness”) and Physical concepts (e.g., “lizard”).
Table 1: The statistics of the parallel concept dataset.
Methodology: Extracting the “Brain Waves” of an LLM
So, how do we measure what the model is thinking? The authors experimented with two distinct ways to extract Concept Embeddings from 10 different LLMs (including Llama-2, BLOOMZ, and mT0).
1. Vanilla Word Embeddings
This is the standard approach. They feed the word (concept) into the model and take the internal representation (vector) of the last token (or an average). This represents the model’s raw, static understanding of the word.
2. Prompt-Based Embeddings
Since modern LLMs are fine-tuned to follow instructions, the researchers tried a more naturalistic approach. They wrapped the concept in a prompt template:
“Summarize concept [text] in one [lang] word:”
For example: “Summarize concept ‘animal’ in one Japanese word”. They then extracted the vector from the model’s hidden state during this process. This tests if the instruction-following behavior changes the geometry of the concepts.
The Mathematics of Alignment: Procrustes Analysis
Once they had the vectors for English concepts and the vectors for French concepts, they needed to see if they fit together. Imagine you have two maps of the world, but one is rotated and slightly shrunken. To prove they are the same map, you need to find the mathematical function that rotates and scales one to match the other.
The researchers used Procrustes Analysis, a statistical technique used to align shapes. They searched for a linear transformation matrix (\(W^*\)) that minimizes the distance between the source language space (\(X\)) and the target language space (\(Y\)).

In this equation:
- \(X\) is the matrix of concept vectors in the source language.
- \(Y\) is the matrix of concept vectors in English.
- \(W\) is the rotation matrix we are trying to learn.
If the models have high-quality implicit alignment, this linear equation should be enough to map French concepts perfectly onto English concepts.
The Experiments and Key Results
The researchers treated this as a retrieval task. After rotating the French space onto the English space, they took a French concept vector, applied the transformation, and looked for its “nearest neighbor” in the English space.
If the alignment is perfect, the nearest neighbor to the transformed “chien” vector should be “dog.” They measured this using Precision@1 (P@1).
Result 1: Large Models exhibit High Linearity
The first major finding is that large multilingual models do contain highly linear concept structures.
Take a look at Figure 2 below. This chart is dense, so let’s break it down:
- The Bars (Ceiling): The Blue and Orange bars represent the “Ceiling” performance. This is the best possible alignment if we use supervised data (using the test data to train the alignment). It tells us: “Is a linear mapping even possible?” The answer is a resounding yes, especially for the orange bars (Vanilla embeddings).
- The Lines (Real Performance): The Black dashed line represents the actual test performance after learning the alignment from a training set.
Figure 2: Performance (P@1) of different LLMs. Comparing performance before alignment (red), after alignment (black), and theoretical ceilings (bars).
The high ceilings indicate that monolingual concept spaces are nearly isomorphic (identical in shape) to English. This suggests that massive scale and compression force the model to discover a universal geometry for concepts, regardless of the language.
Result 2: Prompting Breaks the Geometry
This is one of the most surprising findings. Compare the Orange bars (Vanilla Embeddings) to the Blue bars (Prompt-based Embeddings) in Figure 2.
In almost every case (especially for Llama-2), the Orange bars are higher.
This means that Vanilla word embeddings are more linear than prompt-based embeddings. When you process a word through a prompt (an instruction), the model engages in complex, non-linear processing that actually distorts the pure “concept shape.” While prompting is useful for generating text, it seems to corrupt the implicit cross-lingual alignment that exists in the raw weights.
However, notice the Black lines (After-alignment performance). Prompt-based embeddings often see a huge jump in performance after alignment compared to before alignment. While the geometry is less perfect, it is still highly alignable.
Result 3: The “Typology” Gap
The models are not magic; they struggle with languages that are very different from English.
- Group 1 (French, Romanian): Performance is excellent.
- Group 3 (Japanese, Thai): Performance drops significantly.
This confirms a long-standing issue in NLP: generalization works best for languages with similar typology (sentence structure and grammar).
Result 4: The Abstract Concept Paradox
Intuitively, you might think Physical concepts are easier to align. A “lizard” is a physical object that looks the same in France and Japan. An “abstract” concept like “failure” is culturally dependent and nuanced.
However, the data showed the exact opposite.
Table 2: Precision@1 results comparing Abstract vs. Physical concepts.
As shown in Table 2, Abstract concepts align better than Physical concepts across almost all models and languages. In Llama2-13B (French), Abstract concepts hit 63.48% precision, while Physical ones only hit 50.12%.
Why? The researchers investigated several hypotheses, such as word ambiguity (number of senses), but found the strongest correlation with Frequency.
Table 3: Frequency analysis of the dataset.
Abstract words like “love,” “time,” or “idea” appear incredibly frequently in the training data (Wikipedia, web crawls, etc.). Physical words like “lizard” or “steering wheel” are comparatively rare.
It turns out that frequency drives alignment. The models see abstract concepts in so many diverse contexts that they learn extremely robust, well-defined vector representations for them, creating a more stable geometry that aligns easily across languages.
Consistency Across Scales
The researchers didn’t just stop at a few models. They verified these findings across different sizes of models, from 1.2 Billion parameters up to 70 Billion.
Figure 3: Performance across a wide range of model families and sizes.
Figure 3 reinforces the narrative:
- Bigger is better: Larger models (like Llama2-70B) generally show higher ceilings and better alignment.
- Consistency: The trend of Abstract > Physical and Indo-European > Non-Indo-European holds true regardless of the model size.
Conclusion and Implications
This paper provides compelling evidence that multilingual LLMs are not just memorizing translations. They are constructing a shared, geometric concept space.
Here are the key takeaways for students and practitioners:
- Implicit Universalism: Sufficiently large models naturally converge on a similar “shape” for concepts across languages. We don’t necessarily need massive parallel corpora to achieve this; it emerges from parameter efficiency and compression.
- The Prompting Trade-off: While we love prompting for chat interfaces, this research suggests that prompting adds “noise” to the semantic representation of concepts. If you are using LLMs for embedding-based retrieval or clustering, vanilla embeddings might be structurally superior to prompt-based ones.
- Data Frequency Matters: The counter-intuitive finding that abstract concepts align better than physical ones is a stark reminder that what a model learns is defined by what it sees most. If we want models to understand the physical world better, we need to balance the frequency of physical concepts in training data.
As we move toward even larger and more capable models, understanding this “geometry of thought” will be crucial for building truly universal systems that can bridge the gap between all human languages.
](https://deep-paper.org/en/paper/2410.01079/images/cover.png)