Introduction

In the era of large language models and multimodal deep learning, data is the fuel that powers innovation. Researchers and students alike often rely on massive, publicly available datasets to benchmark new architectures. We assume these datasets are benign—collections of innocuous text and images curated for scientific progress. But what happens when we look closer?

The IIT-CDIP test collection is a behemoth in the document understanding field. Containing over 7 million scanned documents (roughly 40 million pages) from legal proceedings against tobacco companies, it serves as the parent source for several critical benchmarks, including RVL-CDIP (document classification), DocVQA (visual question answering), and FUNSD (form understanding).

However, a recent comprehensive audit of these datasets revealed a startling reality: they are leaking sensitive Personally Identifiable Information (PII). We are not talking about harmless metadata; we are talking about unredacted US Social Security Numbers (SSNs), home addresses, dates of birth, and detailed medical statuses.

This blog post explores a significant research paper that tackles this ethical and privacy nightmare head-on. The researchers did not just identify the problem; they developed a robust, modular pipeline to de-identify these documents. Their approach goes beyond simple black bars—which can degrade machine learning performance—and instead employs synthetic replacement. By swapping real sensitive data with realistic, visually augmented fake data, they aim to preserve the utility of these datasets while protecting individual privacy.

Left: snippets from documents from RVL-CDIP showing sensitive personal information. Right: example document de-identification with synthetic replacement.

As shown in Figure 1, the difference is subtle to the eye but massive for privacy. On the left, real sensitive data (redacted in the image for safety) exposes individuals to identity theft. On the right, the data has been synthetically replaced, maintaining the document’s structure without the risk.

Background: The Scale of the Leak

To understand the solution, we must first understand the scope of the problem. The researchers investigated five specific datasets derived from the massive IIT-CDIP collection:

RVL-CDIP: 400,000 images, used for classification.
Tobacco3482: 3,482 images, used for classification.
Tobacco800: 1,290 images, used for signature detection.
FUNSD: 199 images, used for form understanding.
DocVQA: 12,767 images, used for visual question answering.

The documents in these datasets date from the 1950s to the early 2000s. They are scanned, noisy, and often handwritten—a nightmare for standard Optical Character Recognition (OCR) and text mining tools.

Why Automated Tools Failed

You might ask, “Why not just run a PII detection tool like Amazon Comprehend or Microsoft Presidio?” The researchers asked this too. They tested four major off-the-shelf tools: Microsoft Presidio, Amazon Comprehend, Google DLP, and Microsoft Azure Language Service.

The results highlighted a significant gap in current technology.

Table 1: Sensitive personally-identifiable information (PII) entity categories found in the five datasets derived from IIT-CDIP. Not all entity categories are supported by off-the-shelf detectors.

As Table 1 illustrates, coverage is spotty. While most tools support US Social Security Numbers (SSN), many other critical categories—such as birth places, home addresses, and religious affiliation—are completely unsupported by these automated systems. A model cannot redact what it cannot define.

Even for the categories they do support, the performance was inconsistent. The researchers measured document-level recall—essentially, “If a document contains an SSN, does the tool flag it?”

Table 2: Automated detection performance measured in document-level recall.

Table 2 shows the grim reality. While Microsoft Presidio and Google DLP performed admirably on SSNs (0.97 and 0.93 recall, respectively), Azure and Amazon missed roughly 30% of the documents containing valid SSNs.

The failures were often due to the specific nature of document layouts. These tools are primarily designed for born-digital text streams (like emails or chat logs), not noisy OCR outputs from 1980s scans.

Figure 2: Example SSN detection failures, apparently due to limited context window (top: Google), and missing context keyword (bottom: Amazon).

Figure 2 demonstrates these failure modes. In the top example, a limited context window prevented the model from linking the SSN to the individual. In the bottom example, the lack of a specific keyword (like “SSN:”) caused the tool to overlook a clearly visible social security number.

The Manual Audit

Given the limitations of automated tools, the research team undertook a massive manual inspection. A team of annotators reviewed thousands of documents. The findings were significant:

Total Documents with PII: Over 16,000.
SSNs Found: Over 2,400.
Other PII: Thousands of birth dates, home addresses, and marital statuses.

This confirmed that relying solely on automated scrubbing is insufficient for archival document datasets. A more robust, human-in-the-loop approach was necessary.

Core Method: De-Identification via Synthetic Replacement

The heart of this paper is the proposed de-identification pipeline. The goal was not just to remove information, but to replace it in a way that keeps the document useful for computer vision and Natural Language Processing (NLP) tasks.

If you simply delete text or put a black box over it, you change the visual features of the image. For a model trained to recognize document layouts (like LayoutLM), a black box is a foreign object that might confuse the model. The researchers proposed a synthetic replacement strategy.

1. Bounding Box Annotation

The first step utilized the manual annotations collected during the audit. The researchers identified the exact bounding boxes coordinates \((x, y, w, h)\) for every piece of sensitive text.

Figure 3: Distribution of redacted region ratios for sampled RVL-CDIP resume images.

Figure 3 shows the distribution of how much space this sensitive data takes up. For the vast majority of documents, the PII occupies less than 2% of the total page area. This is good news; it means we can surgically alter these small regions without changing the global structure of the document.

2. Redaction Strategies

The researchers experimented with three distinct redaction styles to determine which best preserved data utility.

Figure 4: The three redaction approaches investigated: black (left), white (center), and pseudonymization with fake data (right).

As seen in Figure 4:

Black Redaction: The standard “government classified” look. High contrast, very obvious, but introduces strong artificial edges into the image.
White Redaction: Masking the text with white pixels. This makes the text disappear into the background. It is cleaner than black boxes but leaves gaps in the text flow.
Pseudonymization (Synthetic Replacement): Replacing the sensitive entity with a fake, semantically equivalent entity.

3. The Synthetic Pipeline

The Pseudonymization approach is the most technically interesting. It involves two main steps: Generation and Augmentation.

Step A: Generation

Using libraries like Faker, the system generates plausible replacements.

If the PII is a date “April 10, 1948”, the system generates a random date like “May 14, 1947”.
If it is an SSN “123-00-6789”, it generates a valid-format fake SSN.
If it is a city “Chicago”, it pulls from a gazetteer of cities to pick “Seattle”.

This ensures that the semantics of the document remain intact. A resume still looks like a resume; it just belongs to a person who doesn’t exist.

Step B: Visual Augmentation

Simply typing “May 14, 1947” in Arial font onto a grainy scan from 1985 would look terrible. It would be an obvious artifact that computer vision models would latch onto. The replacement text needs to look as “noisy” as the original document.

Figure 5: Examples of noise seen in documents from RVL-CDIP.

Figure 5 highlights the challenge: the original documents suffer from scan lines, fading ink, blur, and warping. To match this, the researchers used Augraphy and Albumentations, libraries designed to simulate document degradations.

They applied specific transformations to the rendered fake text:

Ink Mottling: Simulating inconsistent toner application.
Rotation: Slight tilts to match the scan angle.
Blur/Noise: Gaussian blur and grain to match low-resolution scans.

Figure 6: Various augmentations of un-augmented text (upper left). We use augmentations for the pseudonymized data.

Figure 6 demonstrates the transformation. The top-left shows clean digital text. The surrounding examples show the text after augmentation—gritty, blurry, and imperfect.

The Final Result

When you combine accurate detection, semantic generation, and visual augmentation, you get a document that is effectively “healed.”

Figure 7: Examples of documents from RVL-CDIP pseudonymized by us. Our document pseudonymization method replaces real sensitive data with fake, augmented data.

In Figure 7, notice the “Home Address” and “Birthdate” fields. They contain fake data, but visually, they blend almost seamlessly into the surrounding document. This allows a machine learning model to process the document as a “resume” containing an “address” without exposing anyone’s real home location.

Experiments & Results: Does Redaction Break the Data?

The central hypothesis of this paper is that synthetic replacement preserves the “utility” of the data better than simple redaction. To prove this, the authors conducted intrinsic and extrinsic evaluations.

Experiment 1: Document Similarity (Intrinsic)

How similar is the redacted document to the original un-redacted version? Ideally, the distance between them should be near zero because the meaning and layout haven’t fundamentally changed.

The researchers used CLIP (ViT-32), a powerful vision-language model, to compute embeddings for the original documents and their redacted counterparts. They then calculated the Cosine Similarity and Euclidean Distance between them.

Figure 8: Comparison of different redaction methods. Redacting sensitive personal data using black redactions (left) typically causes the redacted image to be more dis-similar to the original.

Figure 8 provides a visual heatmap of the results. The black redaction (left) has a lower similarity score (0.964) and higher distance (2.831) compared to the synthetic replacement (right), which achieves near-perfect similarity (0.997).

This is quantitatively backed up by the distribution of scores across the dataset.

Figure 9: Similarity score distributions for three redaction types.

In Figure 9, look at the green line (“Pseudo” / Synthetic). It peaks sharply near 1.0, indicating that for most documents, the synthetic version is almost indistinguishable from the original in the embedding space. The blue line (Black redaction) has a much longer tail to the left, indicating significant deviation.

Figure 10: Distance score distributions for three redaction types.

Figure 10 tells the same story using distance (lower is better). The green curve hugs the left axis (near zero distance), while the blue curve (Black redaction) is spread out, showing that black boxes push the image representation further away from the original.

Experiment 2: Impact on Redaction Area

Does the amount of redacted text matter? Common sense suggests that the more text you change, the more the document changes.

Figure 11: The relationship between redacted region area and cosine similarity between original and redacted (synthetic-replacement) document pairs in CLIP embedding space.

Figure 11 confirms this negative correlation. As the ratio of redacted area increases (x-axis), the cosine similarity (y-axis) drops. However, because most PII occupies a tiny fraction of the page, the impact remains minimal for the majority of the dataset.

Experiment 3: Downstream Model Performance (Extrinsic)

The ultimate test is functional: If we train or test a classifier on these redacted documents, does it get confused?

The researchers used a DiT (Document Image Transformer) model fine-tuned on RVL-CDIP. They fed it the original and redacted versions of resumes and checked the model’s confidence scores.

Figure 12: The relationship between redacted region relative area and confidence score difference.

Figure 12 plots the change in confidence score (Delta) against the redacted area.

The Result: The confidence shift is negligible. The y-axis scale is tiny (0.000 to 0.004).
Interpretation: Even when PII is replaced, the model still confidently identifies the document as a “resume.”
Comparison: The researchers found that black redactions caused a larger drop in confidence (0.0036 mean difference) compared to synthetic replacement (0.0024).

This confirms that synthetic replacement is the superior method for maintaining data integrity for downstream Machine Learning tasks.

Conclusion & Implications

The work presented in “De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP” serves as a wake-up call and a roadmap for the AI community.

Key Takeaways:

Legacy Data is Risky: We cannot assume old, scanned datasets are free of sensitive PII. 16,000+ documents in IIT-CDIP proved otherwise.
Tools Aren’t Enough: Current off-the-shelf automated detectors have blind spots, particularly with noisy, vintage document scans and specific entity types like birth places.
Synthesis Works: Replacing sensitive data with visually augmented synthetic data is a viable solution. It protects privacy without sacrificing the semantic and visual integrity of the data that modern ML models rely on.

The Broader Impact: This paper pushes the standard for “Responsible AI.” As we move toward training models on everything from medical records to financial invoices, privacy cannot be an afterthought. The technique of “hiding in plain sight”—using realistic fakes to mask real secrets—offers a promising path forward. It allows the scientific community to continue using valuable real-world data without compromising the safety of the individuals described in that data.

The researchers are releasing the redacted versions of these datasets, allowing the community to switch over to a safer standard without losing the benchmarks we’ve come to rely on. It is a crucial step in maturing the field of Document AI.

Introduction#

Background: The Scale of the Leak#

Why Automated Tools Failed#

The Manual Audit#

Core Method: De-Identification via Synthetic Replacement#

1. Bounding Box Annotation#

2. Redaction Strategies#

3. The Synthetic Pipeline#

Step A: Generation#

Step B: Visual Augmentation#

The Final Result#

Experiments & Results: Does Redaction Break the Data?#

Experiment 1: Document Similarity (Intrinsic)#

Experiment 2: Impact on Redaction Area#

Experiment 3: Downstream Model Performance (Extrinsic)#

Conclusion & Implications#