Introduction
If you have ever clicked “I Agree” on a privacy policy without reading a single word, you are in the overwhelming majority. These documents are notorious for being long, dense, and filled with complex legal jargon. However, for regulators and privacy advocates, these documents are the first line of defense in understanding how corporations handle our personal data.
In recent years, the landscape of digital privacy has shifted dramatically. Landmark regulations like the European Union’s GDPR and California’s Consumer Privacy Act (CCPA) have forced companies to be more transparent. They are no longer just required to say that they collect data; they must now disclose specific categories of data, list specific consumer rights, and provide clear methods for users to exercise those rights.
Here lies the problem: while the laws have modernized, the tools we use to audit compliance have not. Most Natural Language Processing (NLP) tools designed to read privacy policies were trained on datasets from before these laws existed. They can identify general concepts, but they struggle to detect the specific legal disclosures mandated by the CCPA.
This gap is what the researchers behind C3PA (CCPA Privacy Policy Provision Annotations) set out to fill. In this post, we will explore how they built the first open, regulation-aware dataset of expert-annotated privacy policies. We will look at how they sourced the data, the rigorous process of annotating legal texts, and how this new dataset enables Machine Learning models to audit regulatory compliance with unprecedented accuracy.
The Background: Why Current Tools Fail
To understand the significance of C3PA, we first need to look at the “state of the art” before this paper. The most frequently used dataset for training privacy policy models is called OPP-115. Created in 2016, it contains 115 privacy policies annotated with a taxonomy of privacy concepts.
While OPP-115 was groundbreaking at the time, it predates the CCPA (introduced in 2018). The CCPA introduced a strict set of requirements for businesses dealing with Californian consumers. For example, a policy must explicitly state whether the company sells personal information, and it must describe the user’s “Right to Delete” that information.
Because older datasets like OPP-115 were “regulation-agnostic”—meaning they were just looking for general privacy descriptions rather than compliance with a specific law—they lack the granular labels needed today. You cannot train an AI to find a “Right to Non-Discrimination” clause if the AI has never seen one labeled before. This has left auditors and regulators without scalable tools to check if the thousands of companies operating online are actually following the law.
The C3PA Methodology
The creation of the C3PA dataset was a multi-stage process involving targeted data sourcing, expert legal annotation, and rigorous quality control.
1. Sourcing the Right Documents
The first challenge was finding privacy policies that were actually relevant. The CCPA doesn’t apply to every website on the internet; it applies to businesses that meet specific revenue thresholds or handle data from a large number of California residents. Randomly scraping the web would result in a noisy dataset.
The researchers targeted two specific groups:
- Registered Data Brokers (DB): These are companies whose primary business is selling data. In California, they are required to register with the Attorney General. By definition, they are subject to the CCPA.
- Popular Websites (WS): Using web traffic data, the researchers identified the top 700 websites with high numbers of Californian visitors and trackers.
Using a custom web crawler, they extracted the privacy policies from these organizations. After filtering out duplicates, third-party policies, and irrelevant pages, they arrived at a corpus of 411 unique privacy policies (241 from Data Brokers and 170 from Popular Websites).
2. The Annotation Scheme
To make the dataset “regulation-aware,” the researchers didn’t invent their own labels. Instead, they went directly to the source text of the CCPA (specifically section 1798.130(a)(5)). They extracted 12 specific disclosure mandates that every compliant policy must contain.

As shown in the table above, the labels (L1 through L12) map directly to legal requirements. These fall into a few categories:
- Updates: When was the policy last updated? (L1)
- Categories: What data is collected, sold, or shared? (L2-L4)
- Rights Descriptions: Does the text explain the consumer’s right to delete, correct, or know their data? (L5-L11)
- Methods: How does a user actually exercise these rights? (L12)
3. Expert Annotation with Legal Professionals
This is where C3PA differentiates itself from many crowdsourced datasets. Privacy policies are legal documents; interpreting them requires domain knowledge. The researchers hired six law students to perform the annotations. These students were already familiar with legal terminology and received specific training on CCPA regulations.
The annotators used a tool called Label Studio to highlight spans of text in the privacy policies that corresponded to the 12 labels.

The figure above shows the interface used by the law students. The HTML of the policies was cleaned to remove distractions (like headers and footers), allowing the annotators to focus entirely on the text provisions.
4. Quality Control and Agreement
Legal text is often ambiguous. One lawyer might think a sentence refers to the “Right to Delete,” while another interprets it as a general “Data Retention” policy. To ensure the dataset was reliable, the researchers implemented a strict quality control process.
They measured agreement using two metrics:
- Cohen’s Kappa: Did both annotators agree that a specific label (e.g., L5) appeared in the document?
- F1 Score: Did the annotators highlight the exact same span of text?
The F1 score is particularly unforgiving. It requires not just conceptual agreement, but precise alignment on which words constitute the disclosure.

As illustrated in the figure above, if Annotator 1 highlights words 2-6 and Annotator 2 highlights words 4-8, the overlap is only partial. The F1 score penalizes this mismatch, providing a rigorous measure of dataset quality.
The annotation process ran for several weeks. Initially, agreement was low as annotators struggled with the complexity of the policies. However, the team held weekly meetings to discuss disagreements and refine their understanding of the mandates.

The graph above demonstrates the learning curve. In Week 1, the F1 scores hovered around 0.45. By Week 4, performance peaked and then stabilized around 0.60–0.70. This improvement indicates that through iterative discussion, the legal experts aligned their interpretations, resulting in a high-quality, consistent dataset.
We can also break this down by specific mandate. Some disclosures are harder to agree on than others.

Notice in the figure above that L1 (Updated privacy policy) has very high agreement—dates are easy to spot. However, L2 (Categories of PI sold) remains lower. This reflects the real-world ambiguity of privacy policies: companies often use vague language to obscure whether they are actually “selling” data under the legal definition, making it difficult even for experts to pinpoint the disclosure.
Analyzing the C3PA Dataset
The final dataset consists of over 48,000 expert-labeled text segments across 411 policies. Analyzing this data reveals fascinating insights into how companies write these documents.
The “Spread” Problem
One of the most critical findings is the concept of “spread.” You might hope that a privacy policy lists all consumer rights in one neat section. The data shows otherwise.

Look at the “Spread” columns in the table above. For L1 (Updated privacy policy), the spread is around 60%. This means a reader (human or machine) needs to scan through roughly 60% of the document to find all mentions of the policy date. For L4 (Categories of PI collected), the spread is nearly 63%.
This confirms a major frustration for consumers: pertinent information is rarely consolidated. It is scattered throughout the “wall of text,” forcing readers to consume the entire document to understand their rights.
Comparison with Legacy Datasets
The researchers performed a contextual analysis to verify if older datasets could essentially “substitute” for C3PA. They trained classifiers on C3PA data and ran them against the older OPP-115 and APP-350 datasets to see if they could find CCPA-relevant text.

The results were stark. As shown in the table above, virtually 0% of the segments in the older datasets corresponded to core CCPA rights like the “Right to Delete” (L5) or “Right to Correct” (L6).
This proves that previous tools are fundamentally blind to modern privacy rights. You cannot use an OPP-115 trained model to audit for CCPA compliance because the training data simply does not contain those concepts.
Experiments: Utility for Automated Audits
The ultimate goal of C3PA is to power automated compliance tools. To demonstrate this, the researchers trained BERT-based machine learning models using their new dataset.
They compared their C3PA-trained models against a model trained on the older OPP-115 dataset. Since OPP-115 doesn’t have labels for things like “Right to Delete,” they could only compare performance on the one overlapping concept: the collection of personal information (L4).
Model Performance
The researchers trained three variations of the C3PA model:
- Datobroker Model: Trained only on data broker policies.
- Website Model: Trained only on popular website policies.
- Combined Model: Trained on both.

The Combined Model achieved a Macro F-1 score of 67%, which is notably close to the inter-annotator agreement score of humans. This suggests the model is performing about as well as a trained legal expert.
High performance was observed for distinct labels like L1 (Updates) and L11 (Non-discrimination), with F-1 scores in the 90s. The model struggled more with complex concepts like L2 (Sale of PI), mirroring the difficulties human annotators faced.
The Head-to-Head Battle
When comparing the C3PA model against the legacy OPP model on the task of identifying “Categories of PI Collected” (L4), the difference was clear.

The legacy opp_model achieved an F-1 score of 68% on the combined validation set. While respectable, the c3pa_combined_model (from the previous table) achieved an F-1 score of 75% for the same task (L4).
This 7% improvement might seem modest, but it is statistically significant in the world of NLP. Furthermore, the C3PA model can predict 11 other legal mandates that the legacy model ignores entirely. This demonstrates that using regulation-aware training data yields superior tools for compliance auditing.
Conclusion and Future Implications
The C3PA dataset represents a significant leap forward in the intersection of law and computer science. By moving away from generic privacy concepts and anchoring their dataset in the specific legal text of the CCPA, the researchers have created a resource that reflects the reality of the modern web.
Key Takeaways:
- Regulation Matters: We cannot rely on pre-2018 datasets to analyze post-2018 internet privacy. The legal landscape has changed, and our datasets must evolve with it.
- Experts are Essential: Accurately labeling legal documents requires domain expertise. The use of law students over general crowdsourcing provided a level of nuance that is essential for training reliable AI.
- Scalable Auditing is Possible: The strong performance of the BERT models trained on C3PA suggests that we can build automated auditors. These tools could scan thousands of websites to flag those that fail to disclose mandatory consumer rights, empowering regulators to enforce the law more effectively.
As more states and countries adopt privacy laws similar to the CCPA (such as Virginia, Colorado, and Utah), the methodology used to create C3PA serves as a blueprint. It paves the way for a future where privacy policies are not just static legal defenses for corporations, but machine-readable documents that ensure accountability and transparency for everyone.
](https://deep-paper.org/en/paper/2410.03925/images/cover.png)