Imagine you are a computer trying to read a 100-page financial report. To you, it isn’t a structured document with chapters, sections, and paragraphs. It is likely just a stream of text segments, perhaps garbled by Optical Character Recognition (OCR), where a single sentence might be split across three different lines.

How do you figure out that “Section 3.1” is a subheading of “Chapter 3,” or that the broken sentence on line 50 connects to line 51?

This problem is known as Document Logical Structuring. It is the process of transforming a sequence of text segments into a hierarchical tree structure. While humans do this intuitively, machines struggle, especially when documents are long and complex.

In this post, we will deep dive into SEG2ACT, a research paper that proposes a novel way to solve this. Instead of treating this as a standard classification problem, the researchers reframe it as an action generation task. By teaching a model to “act” out the structure building process step-by-step, they achieve state-of-the-art results.

The Problem with Current Methods

Before we examine the solution, we need to understand the specific challenges of document logical structuring.

As illustrated in Figure 1 below, the goal is to take raw text segments (often from an OCR engine) and map them into a logical tree (headings and paragraphs).

Figure 1: The illustration of document logical structuring task, which aims to transform text segments into a hierarchical tree structure containing the document’s headings and paragraphs.

This task faces two major hurdles:

  1. Complexity and Noise: Real-world documents are multi-page and lengthy. OCR tools often break content into incomplete lines rather than semantic paragraphs. This makes tracking long-range dependencies—like realizing a subsection on page 10 belongs to a chapter starting on page 8—very difficult.
  2. Structural Diversity: A financial report looks nothing like a scientific paper. Designing a rule-based system that handles both is nearly impossible.

Traditional Deep Learning approaches often break this into a pipeline: first, extract features, then detect headings, then predict relationships between nodes. The downside? Error propagation. If the first step fails, the whole structure collapses. Furthermore, these methods usually look at text in isolation or effectively “forget” the global context of the document.

The SEG2ACT Approach

The core innovation of SEG2ACT is in its name: Segment to Action.

Instead of classifying each line as “Heading” or “Paragraph” in isolation, the authors propose an end-to-end generative framework. They treat the document structuring process as a sequential decision-making game. The model reads text segments and generates a sequence of actions that build the tree dynamically.

The Architecture

The framework consists of three main components working in a loop:

  1. Action Generation Model: A Generative Language Model (GLM) that predicts what to do with the current text.
  2. Global Context Stack: A memory mechanism that keeps track of where we are in the document hierarchy.
  3. Structure Update: The actual execution of the actions to build the tree.

Let’s look at the high-level workflow:

Figure 2: A generation step of SEG2ACT. The action generation model converts current text segments into actions to incrementally construct the document logical structure. A global context stack is maintained to enhance the model’s global awareness, while the generated actions then being employed to update the stack.

As shown in Figure 2, the model takes text segments as input. It outputs an “Action Sequence.” These actions update the logical tree and the Global Context Stack, which is then fed back into the model for the next step. This feedback loop ensures the model never loses track of the “big picture.”

1. Defining the Actions

To make this work, the researchers simplified the complex tree-building process into three fundamental actions. This allows the model to handle any document type using the same vocabulary.

  1. New Level-k Heading (+): This action tells the system to create a new heading node at depth \(k\). For example, +++ might represent a Level-3 heading (e.g., Section 1.1.1).
  2. New Paragraph (*): This creates a new paragraph node under the current active heading.
  3. Concatenation (=): This is crucial for handling OCR noise. It tells the system, “This segment isn’t a new node; it’s a continuation of the previous text.” It appends the current text to the last added node.

By predicting +, *, or =, the model can construct complex trees and stitch together broken sentences simultaneously.

2. The Global Context Stack

The most significant challenge in long documents is long-range dependency. When the model is on page 50, how does it know it’s still inside “Chapter 4”?

Standard models might try to encode the entire document history, which is computationally expensive and noisy. SEG2ACT solves this with a Global Context Stack.

The stack does not store the whole document history. Instead, it stores the active path in the hierarchy tree.

  • When a new Heading starts, the stack “pops” completed sections and “pushes” the new heading.
  • When a paragraph is added, it sits on top of the stack.
  • The stack effectively compresses the global information into a concise list of relevant parent nodes.

Table 1 demonstrates this beautifully:

Table 1: A demonstration example of the model template in a single prediction step. It utilizes the global context stack and multi-segment multi-action strategy.

In the table above, look at the “STACK” section. It shows the hierarchy leading up to the current moment:

  • Level 1: Government Bonds…
  • Level 2: Credit Quality Analysis…
  • Level 3: Use of Proceeds

This gives the model the context it needs to decide that the incoming text segment (“Payment Security Analysis”) should likely start a new section or continue the current one.

3. Multi-segment Multi-action Strategy

Processing a document one line at a time is slow and lacks local context. SEG2ACT employs a “Multi-segment Multi-action” strategy.

The model reads a window of inputs (\(w_I\) segments) and predicts a corresponding window of outputs (\(w_O\) actions).

  • Input: Line A, Line B, Line C
  • Output: Action A, Action B, Action C

This allows the model to “peek ahead.” If Line A is a heading, Line B is a sentence, and Line C is the rest of that sentence, seeing all three helps the model correctly predict New Heading, New Paragraph, and Concatenation.

Training the Model

The model is trained using a standard teacher-forcing cross-entropy loss function. Essentially, the model is penalized if the sequence of actions it generates differs from the ground truth sequence derived from labeled documents.

Equation 1

Here, \(s_i\) represents the Global Context Stack and \(x\) represents the text segments. The goal is to maximize the probability of the correct action sequence \(y\).

Experimental Results

The researchers tested SEG2ACT against several strong baselines, including TRACER (a text-only baseline) and multimodal methods like MTD (which use visual layout features).

They used two datasets:

  1. ChCatExt: A Chinese document dataset with headings and paragraphs.
  2. HierDoc: An English scientific dataset focusing on Table of Contents (heading) extraction.

Superior Performance

The results were impressive. On the ChCatExt dataset, SEG2ACT significantly outperformed the baselines.

Table 2: Overall performance on ChCatExt (Heading, Paragraph, Total nodes in F1-score and logical structure accuracy at the document level). TRACER* refers to our implemented results.

Look at the DocAcc (Document Accuracy) column in Table 2. This metric checks if the entire generated tree matches the ground truth perfectly.

  • Using the Baichuan-7B backbone, SEG2ACT achieved 63.69% accuracy, compared to 53.85% for the baseline TRACER.
  • It also scored higher on F1-scores for individual headings and paragraphs.

Beating Multimodal Models with Text Only

Perhaps the most surprising result came from the HierDoc dataset. Usually, one would expect models that “see” the document (using visual layout features) to perform better at understanding structure.

Table 3: Heading detection (HD) in F1-score and ToC in TEDS (%) of baselines and SEG2ACT on HierDoc.

As shown in Table 3, SEG2ACT (using only text) outperformed MTD and CMM (which use Text + Layout + Vision).

  • ToC (Tree Edit Distance Score): 96.3% vs 88.1%.

This suggests that with a strong enough language model and the Global Context Stack, semantic understanding can trump visual layout cues for logical structuring.

Transfer Learning Capabilities

A major pain point in document AI is that models trained on one type of document (e.g., invoices) usually fail on another (e.g., resumes). The researchers tested SEG2ACT’s ability to generalize using Zero-Shot and Few-Shot settings.

Table 4: Performance (F1-score of total nodes) on transfer learning experiments in zero-shot, few-shot and fullshot settings on three sub-corpora of ChCatExt.

Table 4 shows the results across different document types (Bid Announcements, Financial Announcements, Credit Ratings).

  • Zero-Shot: In the “FinAnn” category, SEG2ACT achieved an F1 score of 43.30, while the baseline TRACER managed only 11.39.
  • Few-Shot: With just a small amount of training data, SEG2ACT adapts very quickly, approaching full-shot performance.

This proves that SEG2ACT isn’t just memorizing specific keywords; it is learning the abstract concept of logical structure, making it highly transferable.

Efficiency and Context Windows

Finally, does the “Multi-segment” strategy actually help?

Table 6: The F1-score of total nodes (inference time per document) of scaling the lengths of the input segment window and output action window for SEG2ACT on ChCatExt. Baseline refers to TRACER in Baichuan-7B.

Table 6 confirms that it does.

  1. Performance: Increasing the input window (\(w_I\)) from 1 to 3 improves the F1-score (from 91.56 to 92.63) because the model has more local context.
  2. Speed: Increasing the output action window (\(w_O\)) significantly reduces inference time. Generating 5 actions at once is nearly 5x faster than generating them one by one, with minimal loss in accuracy.

Conclusion

The SEG2ACT paper presents a shift in how we think about document structuring. Rather than relying on complex pipelines or heavy multimodal features, the authors show that action generation is a powerful paradigm.

By combining a generative language model with a Global Context Stack, SEG2ACT solves the problem of long-range dependencies that has plagued previous methods. It effectively remembers “where it is” in the document hierarchy, allowing it to construct accurate trees even for lengthy, complex documents.

Key Takeaways for Students:

  • Reframing the Task: Sometimes, changing a classification problem into a generation problem (sequence-to-sequence) unlocks better performance.
  • Memory Matters: In long sequences (like documents), standard attention mechanisms aren’t always enough. Explicit memory structures like the “Stack” can guide the model.
  • Less is More: You don’t always need visual features. A strong semantic understanding of text, guided by the right structural constraints, can outperform multimodal models.

As Large Language Models continue to evolve, methods like SEG2ACT that combine the raw power of LLMs with structured, logic-aware components will likely become the standard for processing the world’s unstructured data.