Data is the lifeblood of modern research and business analytics. Whether it’s tracking competitor prices, aggregating news, or building datasets for machine learning, the ability to extract structured data from the web—web scraping—is a critical skill.
However, anyone who has tried to build a web scraper knows the pain. Websites change structure, HTML tags are messy, and maintaining a scraper for hundreds of different sites is a logistical nightmare.
Traditionally, we have had two choices: spend hours manually coding rules for every single website, or pay a fortune to have Large Language Models (LLMs) parse every single page individually. Neither is scalable.
In this post, we are diving into AUTOSCRAPER, a new framework proposed by researchers from Fudan University and Alibaba. This paper introduces a “progressive understanding” agent that solves the scalability problem. Instead of doing the scraping itself, it uses LLMs to write the code that does the scraping, combining the intelligence of AI with the efficiency of traditional code.
The Web Scraping Dilemma
To understand why AUTOSCRAPER is necessary, we first need to look at the limitations of current methods. As illustrated in the figure below, there are two main paradigms for extracting data.

1. The Wrapper-Based Method
This is the traditional approach. A programmer (or an automated script) analyzes the HTML of a specific webpage and writes a “wrapper”—a set of rules (usually XPaths or CSS selectors) that says, “Go to this div, then inside this span, and grab the text.”
- Pros: Once written, these scripts are incredibly fast and cheap to run.
- Cons: They are brittle. If the website layout changes, the wrapper breaks. More importantly, they are not scalable. If you need to scrape 1,000 different e-commerce sites, you need to write 1,000 different wrappers.
2. The Language-Agent-Based Method
With the rise of GPT-4 and other LLMs, a new method emerged. You simply feed the HTML to an LLM and ask, “What is the price of the item?”
- Pros: Highly adaptable. You don’t need to write code; the AI “understands” the page.
- Cons: Extremely expensive and slow. Sending the full HTML of every single product page to an API like GPT-4 is cost-prohibitive for large-scale data collection.
The AUTOSCRAPER Solution
As shown in the bottom section of Figure 1, AUTOSCRAPER takes a hybrid approach. It acts as a manager. It uses an LLM to analyze a few pages and generate a reusable wrapper (a set of extraction rules). Once the wrapper is generated, you can run it on thousands of pages without needing the LLM anymore. It offers the intelligence of agents with the efficiency of wrappers.
The Core Framework
The goal of AUTOSCRAPER is to generate an Action Sequence. This isn’t just a single click; it is a sequence of XPaths (a language used to navigate XML and HTML documents) that prunes the web page down to the exact data point needed.
The researchers define an action sequence \(\mathcal{A}_{seq}\) as a list of XPath expressions:

The framework operates in two distinct phases: Progressive Generation and Synthesis.

Phase 1: Progressive Generation
One of the biggest challenges for LLMs is the sheer size of HTML documents. A modern webpage can have thousands of lines of code, making it easy for an LLM to hallucinate or get lost in the nested div tags.
AUTOSCRAPER solves this by progressive understanding. It doesn’t try to guess the perfect XPath in one shot. Instead, it navigates the HTML structure using a DOM (Document Object Model) tree strategy.
Top-Down Traversal
The agent starts at the root of the HTML document and moves down. It attempts to generate an XPath to locate the target info. If the LLM generates a path that successfully extracts the data, great. But often, the LLM might be too broad or incorrect.
Step-Back Operation
This is where the magic happens. If the generated XPath fails or points to the wrong node, the system performs a “step-back.” It moves up the DOM tree to a parent node that is more reliable, pruning the irrelevant parts of the HTML. It effectively says, “Let’s zoom out and try to find a better anchor point.”
The specific logic for this back-and-forth negotiation with the HTML is detailed in Algorithm 1:

By iteratively refining the scope (pruning the tree), the LLM is fed smaller, more relevant chunks of HTML, significantly increasing the accuracy of the final XPath generation.
Phase 2: Synthesis
A wrapper might work perfectly on Page A of a website but fail on Page B because of a slight layout variation (e.g., Page B has a discount banner that shifts the price location).
To ensure robustness, AUTOSCRAPER doesn’t trust a single page. It uses a Synthesis Module.
- It selects a small set of “seed” webpages (e.g., 3 different product pages).
- It generates candidate action sequences for all of them.
- It executes these sequences across the seed pages.
- It selects the sequence that achieves the highest success rate across all seed pages.
This ensures the final scraper is “generalizable” to the entire website, not just overfitting to a single example.
Why Traditional Metrics Fail
In standard Information Extraction (IE) tasks, we measure success using Precision, Recall, and F1 scores based on the text extracted. However, for scraper generation, these metrics can be misleading.
If a generated scraper works on 90% of the data but crashes or returns nothing on the other 10%, it is fundamentally broken as a software tool. The researchers propose a new metric called Executability.

They classify the generated scrapers into categories:
- Correct: Perfect Precision, Recall, and F1.
- Un-executable: Fails to identify relevant instances (Recall = 0).
- Over-estimate: Extracts garbage where there should be nothing.
This shift in measurement is crucial because it prioritizes generating functional code over just getting the text right once.
Experimental Results
The researchers tested AUTOSCRAPER against standard baselines like Chain-of-Thought (COT) and Reflexion (a method where LLMs reflect on their errors). They used multiple LLM backends, including GPT-3.5, GPT-4, and open-source models like Llama and Mixtral.
1. Superiority Over Baselines
The results were decisive. AUTOSCRAPER consistently generated more usable scrapers than the baselines.
Take a look at the comparison in Figure 4. Here, the task was to find James Harden’s average points.
- COT failed because it grabbed the “Assists” column or the raw “Total Points” instead of the average.
- Reflexion tried to correct itself but still struggled to distinguish between different statistical categories in the complex table.
- AUTOSCRAPER (green check) successfully navigated the structure to find the “PPG” (Points Per Game) column.

2. The Impact of Synthesis
How important is that second phase—checking the scraper against multiple seed pages? The ablation study below (Table 3) tells the story.

When the Synthesis module is removed (rows marked “- synthesis”), the “Correct” rate drops significantly, and the “Unexecutable” rate rises. For example, with GPT-4-Turbo, using Synthesis boosted the “Correct” rate from 65.31% to 71.56%.
3. Efficiency and Cost
This is the most practical finding for developers. While generating the scraper takes time upfront, it pays off when you have to scrape many pages.
The researchers modeled the time cost. Direct extraction (asking the LLM to scrape every page) is linear—it takes the same amount of time per page forever. AUTOSCRAPER has a setup cost (generation), but the execution is near-instant (milliseconds).

As shown in Table 6, the “break-even” point is surprisingly low. For the “Auto” domain, direct extraction (\(T_d\)) takes 8.27 seconds per page. The generated scraper takes only 0.30 seconds.
Calculating the threshold using their derived equation:

The experiments show that if you need to scrape more than ~20 pages from a website, AUTOSCRAPER is faster (and cheaper) than using an LLM directly. Given that most scraping tasks involve thousands of pages, the efficiency gains are massive.
The Role of Seed Websites
One interesting parameter explored was the number of seed websites needed to train the scraper. Does showing the LLM more examples help?

Figure 3 shows that as the number of seed websites increases (from 1 to 5), the “Correct” percentage (gray line) trends upward, and the “Unexecutable” rate drops. This confirms that giving the model a slightly broader view of the website’s variety helps it write more robust code.
Conclusion
AUTOSCRAPER represents a significant step forward in autonomous data collection. It moves us away from the binary choice of “brittle manual scripts” vs. “expensive AI extraction.”
By treating the LLM as a generator of tools rather than the tool itself, we can achieve:
- Scalability: Handle new websites without manual coding.
- Robustness: Use synthesis to ensure rules work across different pages.
- Efficiency: Run extraction at the speed of standard code, not the speed of an LLM.
For students and developers interested in web agents, this paper highlights the importance of structural understanding. It’s not enough for an AI to read the text; for tasks like scraping, the AI must understand the underlying skeleton (the HTML DOM) that holds the web together. As LLMs continue to improve at coding and reasoning, frameworks like AUTOSCRAPER will likely become the standard for automated information gathering.
](https://deep-paper.org/en/paper/2404.12753/images/cover.png)