Can We Watermark AI Text Without Model Access? Deep Dive into POSTMARK

Large Language Models (LLMs) are reshaping the internet. From generating news articles to writing code, the volume of machine-generated content is exploding. But this capability comes with a shadow side: hallucinations, bias, and the potential for mass-produced disinformation. If the web becomes flooded with AI-generated text, how can we trust what we read? Furthermore, if future AI models are trained on the output of today’s AI, we risk a feedback loop of degrading quality.

To solve this, researchers have turned to watermarking—embedding a hidden, detectable signature into the text generated by an AI.

However, there is a major bottleneck in current watermarking technology. The most effective methods require access to the model’s internal probability distributions (logits). If you are using an API from a provider like OpenAI or Google, you don’t get access to these logits. This makes it impossible for third parties to verify content provenance effectively. Moreover, existing watermarks are notoriously brittle; a simple paraphrase can often scrub the watermark clean.

In this post, we are diving into POSTMARK, a new research paper that proposes a clever solution. POSTMARK is a “blackbox” method, meaning it doesn’t need to see the model’s internals. It applies the watermark after the text is generated, and it is surprisingly robust against attacks.

The Problem with “Green” and “Red” Lists

To understand why POSTMARK is necessary, we first need to look at how standard watermarking works. The dominant approach (like the algorithm by Kirchenbauer et al.) intervenes during the generation process.

When an LLM generates the next word in a sentence, it calculates the probability for every word in its vocabulary. Standard watermarking algorithms use the previous word to seed a random number generator, which splits the vocabulary into a “Green List” and a “Red List.” The model is then forced (or strongly encouraged) to pick a word from the Green List.

To a human, the text looks normal. To a detector knowing the seed algorithm, the text has a statistically impossible number of “Green” words.

The Limitations:

No Logit Access: You cannot implement this unless you own the model. If you are building an app on top of GPT-4, you can’t force GPT-4 to prioritize “Green List” words because you don’t control the sampling loop.
Paraphrasing Vulnerability: These methods rely on specific sequences of tokens. If a user takes the output and asks another AI to “rewrite this paragraph,” the token sequence changes, the Green/Red ratio normalizes, and the watermark vanishes.
Low Entropy Issues: In “low entropy” scenarios—like answering a factual question where there is only one right answer—the model cannot choose a Green word if the only correct word is on the Red list.

Enter POSTMARK: Watermarking by Rewriting

The researchers propose a completely different paradigm. Instead of watermarking during generation, POSTMARK is a post-hoc method. It takes the finished text from any model and rewrites it to inject a signature.

The core intuition is simple but powerful: The semantics (meaning) of a text should not change drastically after watermarking or paraphrasing.

Therefore, instead of basing the watermark on exact token sequences, POSTMARK bases it on the meaning of the text.

The Architecture

The POSTMARK pipeline consists of three distinct modules: an Embedder, a SecTable (Secret Table), and an Inserter.

As shown in Figure 1 above, the process works in a pipeline:

Generation: An LLM generates the initial, unwatermarked text.
Embedding: The text is passed through an Embedder (like OpenAI’s text-embedding-3-large). This converts the text into a dense vector representing its semantic meaning.
Word Selection (The “Secret”): This is the cryptographic heart of the system. The system maintains a SecTable, which maps a vocabulary of words to random vector embeddings. The system compares the input text’s embedding against this table to find words that are “semantically close” to the text according to this secret mapping.

Note: It doesn’t just pick random words. It selects words based on a calculation involving the text’s meaning and the secret table. This ensures that if the text is paraphrased (retaining its meaning), the system will likely select the same set of watermark words again.

Insertion: The system creates a list of “watermark words” (e.g., “protective,” “drove,” “surprised”). It then passes the original text and this list to the Inserter (an instruction-following LLM) with a prompt: Rewrite this text to include these words.

The result is a coherent piece of text that carries a hidden payload of specific words determined by its own meaning.

The Detection Mechanism

Detecting the watermark follows a similar logic. You don’t need the original text, but you do need the Embedder and the SecTable.

Take the suspect text (Candidate Text).
Generate its embedding.
Use the SecTable to reconstruct the list of words that should be in the text if it were watermarked.
Check how many of those predicted words actually appear in the text.

The paper defines a presence score, \(p\), to quantify this:

Equation for calculating the presence score p.

Here, the score \(p\) is the fraction of predicted watermark words found in the text. To be robust against slight changes (like “run” vs “running”), the method doesn’t demand exact matches. It counts a word as “present” if a semantically similar word (cosine similarity \(\ge 0.7\)) exists in the text. If \(p\) exceeds a certain threshold, the text is flagged as watermarked.

Experimental Results: Robustness is Key

The primary claim of POSTMARK is robustness. If an attacker paraphrases the text, the specific words might change, but the semantic meaning usually stays the same. Because the watermark words are derived from the semantics, the detector should still look for the same words. Even if the paraphraser removes some, enough should remain to trigger detection.

The researchers compared POSTMARK against 8 baseline algorithms, including the standard “Green/Red list” method (KGW) and other recent techniques like Unigram and EXP.

Table 1: Comparison of POSTMARK and baselines.

Table 1 above presents the results. The key metric is TPR (True Positive Rate) at 1% FPR (False Positive Rate). This measures how often the system correctly detects a watermark while keeping false alarms very low.

Key Takeaways from the Data:

Before Paraphrasing: Almost all methods perform well (high 90s).
After Paraphrasing: This is where the carnage happens. Look at the LLAMA-3-8B-INST / OpenGen row. The standard KGW method drops to 41.4%. The EXP method crashes to 2.2%.
POSTMARK’s Performance: On the same task, POSTMARK retains a 91.8% detection rate before paraphrasing (slightly lower than others) but maintains a massive advantage after paraphrasing. However, note that in Table 1, the specific entry for Llama-3-8B-Inst OpenGen shows POSTMARK at 1%, which seems to be an outlier or specific configuration issue mentioned in the paper text, but generally, looking at Mistral-7B-INST, POSTMARK holds strong at 89.8% while others struggle.
Blackbox Comparison: The only other “Blackbox” baseline (which substitutes synonyms) is significantly weaker, dropping to ~20-25% after paraphrasing.

Entropy Independence

One subtle but critical finding is how POSTMARK handles “low entropy” models. Modern models trained with RLHF (Reinforcement Learning from Human Feedback) tend to be very confident and repetitive (low entropy).

Logit-based watermarks struggle here because the model refuses to pick “Green list” words if they aren’t the absolute best fit. POSTMARK, however, doesn’t care about the generation probability. It rewrites the text after the fact. This makes it ideal for the highly-aligned, instruction-tuned models we use today.

Modular and Open Source

One of the strengths of POSTMARK is its modularity. You can swap out the Embedder or the Inserter. While the main experiments used OpenAI’s GPT-4 and embeddings, the authors also tested an open-source configuration.

Table 2 showing open source implementation results.

As shown in Table 2, using Llama-3-70B-Instruct as the Inserter and Nomic-Embed as the Embedder yields results comparable to the closed-source version (TPR of 100/52.1 vs 99.4/59.4). This means the method can be deployed completely offline or by organizations ensuring data privacy.

But Does It Ruin the Text?

The biggest fear with watermarking is that it degrades the quality of the writing. Since POSTMARK actively inserts specific words, there is a risk of the text becoming awkward or forced.

To test this, the authors conducted extensive quality evaluations. First, let’s look at what kind of edits the system actually makes.

Table 3: Example edits made by POSTMARK.

Table 3 shows that the Inserter (GPT-4) is quite sophisticated. It doesn’t just jam words in; it weaves them into the narrative.

Clarification: It might add “of imprisonment” to clarify “sentences.”
Metaphors: It might add a phrase like “wears an armor of resilience” to include the words “armor” and “resilience.”
New Details: It sometimes adds plausible details to fit the words, which is a double-edged sword (discussed in the Factuality section below).

Automated Quality Metrics

The researchers used “LLM-as-a-Judge” (using GPT-4-Turbo) to compare the quality of watermarked text against unwatermarked text and other baselines.

Table 6 and Table 7 showing soft win rates and quality-robustness trade-offs.

In Table 6, “Soft Win Rate” indicates how often the judge preferred the watermarked text (or tied). POSTMARK achieves significantly higher win rates (64-74%) compared to baselines like Unigram (23-59%).

Table 7 highlights the Trade-off: As you insert more words (higher \(r\), or insertion ratio), robustness goes up (detection is easier), but quality goes down. POSTMARK@12 (inserting words equal to 12% of the text length) seems to be the sweet spot.

Semantic Preservation

Does the meaning change? The authors calculated the cosine similarity between the original and watermarked text.

Table 4: Average cosine similarity between unwatermarked and watermarked outputs.

Table 4 shows extremely high similarity scores (~0.94 - 0.95), suggesting that while the wording changes, the core message remains intact.

Human Evaluation

Metrics are useful, but human preference is the gold standard. The study hired annotators to compare texts side-by-side.

Figure 2: Pairwise preference human evaluation results.

Figure 2 reveals a surprising result: For both the strong (@12) and weak (@6) configurations, humans rated the watermarked text as “at least as good” as the original the vast majority of the time (Total > 100% due to the visualization stacking wins + ties). In many cases, the rewriting step by a strong model like GPT-4 actually improved coherence and interestingness.

Furthermore, can humans spot the watermark words?

Answer: No.
Annotators were asked to highlight “out-of-place” words. They achieved an F1 score of only 0.06. They simply could not distinguish inserted watermark words from the rest of the text.

The Hubness Problem

A technical detail worth noting is the distribution of watermark words. Are we inserting the same words over and over?

Figure 3: Watermark word frequency distribution.

Figure 3 shows that while most words are rarely used, there are “hub words” (the bars on the far right) that get selected frequently. This is a known phenomenon in high-dimensional vector spaces. While not a dealbreaker, it suggests that future versions of the SecTable could be optimized to ensure a more even distribution of inserted vocabulary.

Conclusion

POSTMARK represents a significant step forward in AI text detection. It shifts the paradigm from manipulating probability distributions (which requires owning the model) to semantic watermarking (which can be done by anyone).

Key Takeaways:

Logit-Free: It works on APIs where you only get the text output.
Paraphrase Robust: By anchoring the watermark to the meaning rather than the specific phrasing, it survives rewriting much better than current methods.
High Quality: The rewriting step, when done by a capable LLM, preserves (and sometimes enhances) readability.

As we move toward a web dominated by AI generation, tools like POSTMARK offers a way to verify provenance without compromising the quality of the content or requiring the cooperation of closed-source AI giants. While no watermark is unbreakable, POSTMARK raises the bar significantly for anyone trying to strip attribution from AI-generated text.

The Problem with “Green” and “Red” Lists#

Enter POSTMARK: Watermarking by Rewriting#

The Architecture#

The Detection Mechanism#

Experimental Results: Robustness is Key#

Entropy Independence#

Modular and Open Source#

But Does It Ruin the Text?#

Automated Quality Metrics#

Semantic Preservation#

Human Evaluation#

The Hubness Problem#

Conclusion#