Imagine you are looking at a spreadsheet. In one cell, you see the word “Fallen.”
Without looking at the rest of the table, “Fallen” is ambiguous. Is it the 1998 movie starring Denzel Washington? Is it the best-selling novel? Or is it the debut studio album by the rock band Evanescence?
As humans, we instantly scan the row to see “Evanescence” (Artist) and “2003” (Year), and we scan the column to see other entries like “The Fame” and “Let Go.” We immediately know: This is the album.
This process is called Table Entity Linking (TEL). It is the task of mapping a mention in a table (like “Fallen”) to a specific unique entry in a Knowledge Base (like Wikidata or Wikipedia). While this seems easy for humans, machines struggle with it. Why? Because most AI models are trained to read text linearly, from left to right. But tables are two-dimensional. They have rows and columns, and those two dimensions mean very different things.
In this post, we are diving deep into a research paper titled “RoCEL: Advancing Table Entity Linking through Distinctive Row and Column Contexts.” We will explore how the researchers built a model that finally treats rows and columns differently, achieving state-of-the-art results that even powerful Large Language Models (LLMs) struggle to match.
The Problem with Flat Reading
Most existing methods for Entity Linking (EL) treat a table like a scrambled sentence. They take the table cells, flatten them into a sequence of text, and feed them into models like BERT. While this captures some context, it destroys the structural integrity of the table.
Even models designed specifically for tables, like TURL, often use mechanisms that mix row and column information indiscriminately.
The authors of RoCEL argue that this is a mistake because rows and columns are semantically distinct:
- Row Context: Provides descriptional information. The cells in a row usually describe properties or related objects (e.g., The release year of this album).
- Column Context: Provides categorical information. The cells in a column usually belong to the same type or category (e.g., A list of other albums).
To solve this, the researchers propose RoCEL (Row-Column differentiated table Entity Linking).

As shown in Figure 1 above, understanding “Fallen” requires integrating the row context (Release Year, Artist) and the column context (other Albums). RoCEL is designed to model these two dimensions separately before fusing them for a final decision.
Understanding Table Contexts
Before we look at the architecture, let’s formalize what a table looks like to a machine.
A table \(T\) consists of rows and columns. We are interested in a specific cell, the mention, located at row \(i\) and column \(j\), denoted as \(T_{ij}\).

Figure 2 illustrates the available data:
- Metadata (\(T_{\Theta}\)): The table caption or title (e.g., “Best-selling albums…”).
- Row (\(T_{i*}\)): All cells in the same horizontal line.
- Column (\(T_{*j}\)): All cells in the same vertical line.
- Headers (\(T_{H*}\)): The labels at the top of the columns.
The goal of the task is simple: Given a mention \(T_{ij}\), link it to the correct entity \(e\) in a Knowledge Base.
The RoCEL Architecture
The core innovation of RoCEL is its two-pronged approach. It doesn’t use a single “table encoder.” Instead, it uses a Row Context Encoder and a Column Context Encoder, each architected differently to match the semantic nature of that dimension.
Let’s look at the high-level architecture:

As you can see in Figure 3, the process is split into three stages:
- Row Context Encoding: Creates a vector representation (\(v^{mr}\)) based on the row.
- Column Context Encoding: Creates a vector representation (\(v^c\)) based on the column.
- Semantics Fusion & Entity Linking: Combines them to find the matching entity.
Let’s break down each stage.
1. Row Context Encoding: capturing Dependencies
Think about a row in a database. “2003” relates to “Fallen” because it is its release year. There is a strong, implicit dependency between cells in a row. It reads almost like a sentence: “In 2003, the album Fallen by Evanescence was released in the United States.”
Because of this sequential, sentence-like dependency, the authors use BERT, a transformer model famous for handling natural language sequences.
First, they serialize the row into a text format. They pair each cell with its header to make the meaning clear (e.g., “Release Year: 2003”).

In this equation:
- \(p_{ik}\) represents a text piece for a specific cell.
- Note the special tokens
[START]and[END]used to highlight the target mention \(T_{ij}\) within the sequence. - Everything is concatenated with the table metadata \(T_{\Theta}\).
Once this sequence is built, it is fed into BERT. The model uses the self-attention mechanism to understand how “Fallen” relates to “Evanescence” and “2003.”

We take the output of the [CLS] token (a special token used in BERT to represent the whole sequence) as our Row-Contextualized Mention Embedding (\(v_{ij}^{mr}\)).
2. Column Context Encoding: Capturing Categories
Now, consider the column. The column contains “The Fame,” “Fallen,” and “Let Go.” Unlike the row, there is no “sentence” here. The order doesn’t matter. “The Fame” isn’t the subject of “Fallen.” They are independent items that just happen to share a category (Albums).
Therefore, treating a column as a sequence (like a sentence) is semantically wrong. It introduces an artificial order that doesn’t exist.
Instead, RoCEL treats the column as an unordered set. The goal is to extract the “gist” or the “type” of the column from this set of entities. To do this, they use a Set-wise Encoder called FSPool (Feature-wise Sort Pooling).
The input to this encoder is the set of embeddings from all the mentions in that column (which we obtained from the Row Encoder step).

Here, \(v_{j}^{c}\) represents the Column Embedding. It condenses the information of all the albums in that column into a single vector representing the concept “Music Albums.”
3. Warming Up the Column Encoder
Here lies a subtle engineering challenge. The Row Encoder uses BERT, which is pre-trained on massive amounts of text and already “knows” a lot. The Column Encoder (FSPool), however, is initialized randomly.
If we train the whole model from scratch, the noisy, random signals from the Column Encoder might confuse the model initially. To fix this, the researchers introduce a Warm-up Stage. Before training the main Entity Linking task, they pre-train the column encoder using auxiliary tasks.
They propose two ways to do this:
A. Supervised Column Typing: If we have labels for what the column is (e.g., “Music Albums”), we can train the encoder to predict this label.

B. Unsupervised Set Reconstruction: What if we don’t have labels? We can use an auto-encoder approach. We try to compress the set of column items into a single vector and then see if we can reconstruct the original set from that vector. If the reconstruction is accurate, the encoder has successfully captured the essential information of the column.

This warm-up phase ensures the Column Encoder is smart enough to contribute meaningful signals when the real training begins.
4. Fusion and Linking
Finally, we have a row embedding (\(v_{ij}^{mr}\)) describing the specific entity and a column embedding (\(v_{j}^{c}\)) describing the category. We concatenate them and pass them through a Multi-Layer Perceptron (MLP) to mix the features.

This gives us the final mention embedding \(v_{ij}^{mrc}\).
To find the correct entity in the Knowledge Base, we calculate the similarity (dot product) between our mention embedding and the embeddings of all candidate entities (\(u_e\)) in the database.

The system is trained using Cross-Entropy loss to maximize the score of the correct entity against the others.

Experimental Results
Does distinguishing between rows and columns actually help? The researchers tested RoCEL against several state-of-the-art baselines on four benchmark datasets (TURL-Data, T2D, Wikilinks-R, and Wikilinks-L).
The baselines included:
- Cell-based: Using only the mention text (e.g., Wikidata Lookup).
- Text-based: Flattening the table into text (e.g., BERT, BLINK).
- Table-based: Using table structure but often mixing contexts (e.g., TURL).
The Main Takeaway: RoCEL outperforms them all.

As seen in the results table above, RoCEL (specifically the versions R-C and R-S which utilize the warm-up strategies) achieves the highest accuracy across both in-domain and out-of-domain datasets. It beats TURL by a significant margin, proving that the distinct modeling of row and column semantics is superior to a unified approach.
Ablation Studies: What matters more?
The researchers also broke down the model to see which parts were carrying the most weight. They removed different contexts one by one.

The results (Table 2) are revealing:
- Row context is king: Removing the row context caused the biggest drop in performance (\(86.9 \rightarrow 83.2\)). The description of the entity (e.g., the artist name) is the most critical clue.
- Columns matter too: Removing columns also hurt performance, confirming that knowing the category helps disambiguation.
- Metadata helps: Even the table caption provides useful context.
Can’t we just use ChatGPT?
In the era of Generative AI, a natural question is: “Why build a specialized model? Can’t Llama-3 or GPT-4 just figure this out?”
The researchers tested Llama-3-8B on this task. They provided the table context in various text formats (JSON, Markdown, etc.).

They tried feeding the LLM just the row, just the column, or the whole table (Multi-row).

The results were surprisingly poor (around 52-58% accuracy, compared to RoCEL’s 86%+).
Why?
- Context Overload: LLMs struggle to parse the strict 2D structure of a table when it is serialized into a 1D prompt.
- Noise: When provided with “Multi-row” contexts (more table data), the LLM’s performance actually dropped compared to just seeing one row. This suggests LLMs get confused by the extra tabular noise rather than using it to infer column types.
While massive models like GPT-4 perform better, RoCEL achieves competitive or superior results with a fraction of the parameters (340M vs Billions) and significantly faster inference speeds, making it a much more practical solution for processing millions of database records.
Impact of Warm-up Tasks
Finally, the researchers validated their “Warm-up” strategy. Does pre-training the column encoder really help?

Figure 6 shows that models with warm-up (SR for Set Reconstruction, CT for Column Typing) consistently beat the model with No Warm-up (NW). Interestingly, Column Typing is very data-efficient—it provides a big boost even with just 1% of the training data.
They also checked if the encoder was actually learning “types.”

Figure 7 shows that as we increase the warm-up data, the model becomes highly effective at predicting the correct column type (e.g., identifying a column as “Films” or “Cities”), confirming that the encoder is working as intended.
Conclusion
The RoCEL paper teaches us a valuable lesson in AI architecture: Structure implies semantics.
By treating a table not just as a bag of words, but as a structured object where horizontal lines represent descriptions and vertical lines represent categories, the researchers built a model that aligns much closer to how humans interpret data.
Key takeaways for students:
- Context differentiation: Don’t treat all inputs the same. If the data has structure, build that structure into your model.
- Set vs. Sequence: Know when to use an RNN/Transformer (for sequences) and when to use a pooling mechanism (for sets).
- Auxiliary Tasks: If part of your network is hard to train (like the column encoder), create a warm-up task to get it ready before the main event.
As we move toward more complex data processing, specialized architectures like RoCEL show that while LLMs are powerful generalists, targeted, structure-aware models still hold the crown for precision tasks like Table Entity Linking.
](https://deep-paper.org/en/paper/file-3588/images/cover.png)