When we read a novel, the process is straightforward: left to right, top to bottom, line by line. But consider how you read a receipt, a newspaper with multiple columns, or a complex form. You might scan the header, jump to a specific table, read down a column, and then skip to the total at the bottom.

This is the challenge of Visually-rich Documents (VrDs). In the field of Document AI, understanding the “Reading Order” is crucial. If a model reads a two-column document straight across the page (crossing the gutter), the sentences become nonsense.

For years, researchers have treated Reading Order Prediction (ROP) as a sequence generation problem—trying to force a complex 2D layout into a single 1D permutation (a list). In the paper “Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding,” researchers argue that this approach is fundamentally flawed. They propose a paradigm shift: modeling reading order not as a sequence, but as ordering relations (a graph).

In this post, we will deconstruct this paper, exploring why the old “sequence” method fails, how the new “relation-based” method works, and how it significantly boosts performance in downstream tasks like information extraction and Question Answering (QA).

The Problem: The Linear Sequence Trap

Current state-of-the-art models often define the reading order of a layout as a permutation of all its elements (words or text segments). The goal is to find a sequence \(S = (e_1, e_2, ..., e_N)\) that covers every element in the document.

However, complex documents rarely have just one correct reading order.

Figure 1: Motivation of reformulating layout reading order. In complex document layouts, multiple reading sequences are acceptable (displayed in the first three rows); thus the reading order information may be incomplete if represented by one single sequence. We propose to represent the relationship of immediate succession during reading among layout elements using a directed acyclic relation (displayed in the last row as a directed acyclic graph), ensuring that the complete layout reading order information is conveyed.

As shown in Figure 1, consider a document with a header and two columns. A human might read the Header, then Column A, then Column B. Another might read the Header, Column B, then Column A. Both are “correct” because the columns are independent. If we force a model to predict a single linear sequence, we introduce arbitrary biases (noise) that don’t reflect the document’s semantic structure.

Furthermore, forcing a linear sequence on a table (where you might read row-wise or column-wise) or a form with independent sections fails to capture the spatial logic of the document. The researchers argue that a single permutation cannot convey the complete reading order information.

The Solution: Reading Order as Ordering Relations

To solve this, the authors propose modeling reading order as Immediate Succession During Reading (ISDR).

Instead of asking, “What is the rank of this word in the whole document?”, they ask, “Which element immediately follows this element?”

This shifts the mathematical formulation from a Strict Total Order (a single line) to a Directed Acyclic Relation (DAR). In graph theory terms, the document becomes a Directed Acyclic Graph (DAG) where nodes are text segments and edges represent the flow of reading.

Why Relations Matter

This approach handles non-linear layouts gracefully.

Figure 3: Several example layouts with non-linear reading order. Annotations are drawn as block-level for better visualization. (a) The complex layout includes multiple possible reading sequences (illustrated in Fig. 1); (b) The reading order of header, footer and watermark within the layout are separated from the main body; (c) The table within the layout can be read either vertically or horizontally; (d) Indirect reading order relationship is also important as relevant elements may be separated by other contents.

  • Design Layouts (Fig 3a): Multiple valid paths can exist simultaneously.
  • Independent Elements (Fig 3b): Headers, footers, and watermarks often don’t “follow” the body text—they exist in their own isolated reading space. A graph can leave them disconnected from the main flow.
  • Tables (Fig 3c): A cell in a table can logically lead to the cell on the right or the cell below. A graph allows a node to have multiple successors.

The authors also introduce Generalized Succession During Reading (GSDR), which is the transitive closure of ISDR. While ISDR looks at immediate neighbors, GSDR captures the global “before/after” relationship between any two elements.

The Core Method

The researchers introduce a two-stage approach:

  1. Prediction: A model that predicts the reading order relations (ROP).
  2. Enhancement: A pipeline that uses these relations to improve downstream tasks (RORE).

1. Reading Order Prediction (ROP) as Relation Extraction

The authors reformulate ROP as a relation extraction task. Given a document \(D\) with layout elements (words or segments) and their bounding boxes, the goal is to predict pairs \((i, j)\) where element \(j\) immediately follows element \(i\).

The Architecture

They use a baseline model inspired by the Global Pointer Network. It starts with a Pre-trained Text-and-Layout Model (PTLM), such as LayoutLMv3.

First, they extract layout-aware embeddings for every token in the document:

Equation 1

Here, \(x\) represents text tokens and \(b\) represents bounding box coordinates.

Since a layout element (like a text segment) might contain multiple tokens, they pool the token embeddings to get a single vector \(h_i\) for each element:

Equation 2

The Global Pointer Network

To predict relations, they don’t just classify pairs. They use a global pointer mechanism to score the relationship between element \(i\) and element \(j\). This involves projecting the embeddings into query (\(q\)) and key (\(k\)) representations:

Equation 3

The score \(s_{ij}\) represents the likelihood that there is a reading link from \(i\) to \(j\).

The model is trained using a specialized loss function designed to handle class imbalance (since most pairs of words are not connected):

Equation 4

During inference, they simply look for pairs with a positive score:

Equation 5

2. The RORE Pipeline: Enhancing Downstream Tasks

The ultimate goal isn’t just to predict reading order; it’s to use that understanding to do actual work, like extracting total amounts from receipts or answering questions about a form.

The authors propose the Reading-Order-Relation-Enhancing (RORE) pipeline.

Figure 2: The reading-order-relation-enhancing pipeline (right, green) comparing with the original pipeline (left, blue) for general document processing. “RM” denotes Malaysian Ringgit.

As seen in Figure 2, the traditional pipeline (blue) feeds OCR results directly into a task model. The RORE pipeline (green) adds an intermediate step: it runs the OCR results through the ROP model defined above, generates a “Reading Order Matrix,” and feeds that into the task model.

The Reading Order Matrix

How is this relation information fed into a neural network? They convert the graph connections into an \(n \times n\) binary matrix (where \(n\) is the number of tokens).

Figure 4: Reading order relation information is represented as a n * n binary matrix to be leveraged in downstream VrD tasks, where n is the number of input textual tokens.

If token A connects to token B, the corresponding cell in the matrix is 1.

Relation-Aware Attention

Standard Transformers use a self-attention mechanism where tokens attend to each other based on semantic similarity:

Equation 10

The authors modify this equation to inject the reading order knowledge. They introduce a Relation-Aware Attention Module. They add the binary relation matrix (scaled by a learnable weight \(\lambda\)) directly into the attention scoring:

Equation 12

Here, \(\rho_{ij}\) comes from the binary matrix. This effectively tells the model: “Pay extra attention to token \(j\) if it immediately follows token \(i\) in the visual layout.”

Experimental Results

To evaluate this new paradigm, the authors created a new benchmark dataset called ROOR (Reading Order as Ordering Relations), based on the existing EC-FUNSD dataset, but re-annotated with graph-based relations rather than simple sequences.

1. Performance on Reading Order Prediction

First, does the relation-based model work better than the old sequence-based models?

Table 1: The performance of baseline models on reading order relation prediction. Human performance indicates the annotation consistency between two annotators.

Table 1 shows a massive improvement. “LR” and “TPP” are previous sequence-based methods. The proposed method (LayoutLMv3-base/large used as a relation extractor) nearly doubles the performance in segment-level F1 scores (from 42.96 to 82.38). This confirms that modeling reading order as a graph is far more accurate than forcing it into a line.

2. Enhancing Downstream Tasks (IE and QA)

The real test is whether this helps the model understand the document better. The authors tested this on Semantic Entity Recognition (SER) and Entity Linking (EL).

Table 2: Performance of LayoutLMv3 and GeoLayoutLM and their corresponding RORE methods on EC-FUNSD. Reproduced results are marked with * (see Appendix D).

Table 2 demonstrates that adding the RORE pipeline improves performance across the board. The improvement is particularly striking for Entity Linking (EL), seeing a jump of over 6% for the base model. This makes sense: Entity Linking relies heavily on understanding the logical flow and structure of a document (e.g., linking a “Total” label to the price value next to it), which is exactly what the reading order relations capture.

They also extended this to general benchmarks like FUNSD, CORD, and SROIE using pseudo-labels. This means they didn’t have ground-truth reading order for these datasets; they used their ROP model to guess the reading order and then trained the downstream model using those guesses.

Table 3: Performance of prevailing methods on three VrD-IE benchmarks. Best results are marked bold.

As shown in Table 3, even using generated (pseudo) reading order relations allows their method (RORE-GeoLayoutLM) to achieve state-of-the-art results (marked in bold). This is a significant finding: it means we can use a ROP model trained on one dataset to enhance document understanding on entirely different datasets without manual annotation.

Case Studies: Seeing the Difference

Visualizing the model’s predictions helps clarify why the graph approach is superior.

Figure 5: Case study of the proposed reading order prediction model. Each arrow represents a predicted relation linking between segments.

In Figure 5, we see the model handling complex scenarios:

  • (a, b): It correctly links spatially separated text that belongs together while ignoring headers/footers.
  • (f, g): It handles tables correctly, understanding that cells relate to neighbors, without forcing a zigzag path across the whole page.

Finally, let’s look at how this impacts the final output.

Figure 6: Case study of baseline models and their corresponding reading-order-relation-enhanced variants for VrD-SER, VrD-EL and VrD-QA. Entities are marked with shades and are distinguished by the shade of color. Entity linking are marked with arrows.

Figure 6 shows specific corrections. In the visual QA example (bottom row), the baseline model confuses the price for “3 Posters” by picking a nearby number (\(15,000). The RORE model, guided by the reading order relations (which likely link the line item row-wise to its correct price), correctly identifies the value (\)27,000).

Conclusion and Implications

This paper highlights a subtle but critical flaw in how we have historically processed documents: the assumption of linearity. By accepting that document layouts are inherently non-linear and graph-like, the authors have:

  1. Redefined the Problem: Proposed “Immediate Succession During Reading” (ISDR) as a Directed Acyclic Relation.
  2. Improved Prediction: Built a relation-extraction model that far outperforms sequence-based predecessors.
  3. Boosted Application: Developed the RORE pipeline, proving that explicit reading order information—even when machine-generated—can significantly enhance performance in complex information extraction and QA tasks.

For students and practitioners in NLP and Computer Vision, this work suggests that “structure” is just as important as “content.” When building models for visually-rich data, we must provide the architecture with the right inductive biases—in this case, the knowledge that reading is a branching, relational process, not just a straight line.