Knowledge Editing in the Wild: Why LLM Surgery is Harder Than We Thought

Large Language Models (LLMs) are frozen in time. When a model like GPT-4 or Llama 2 finishes training, its knowledge of the world is locked to that specific moment. But the world doesn’t stop. Presidents change, companies merge, and scientific discoveries overturn old theories.

So, how do we keep these models up to date? The obvious answer is to retrain them, but that is prohibitively expensive and slow. This has given rise to a fascinating sub-field called Knowledge Editing. The goal is simple: surgically alter the model’s parameters (or its behavior) to inject a specific new fact without breaking everything else it knows.

Ideally, if the Prime Minister of the UK changes, we want to make one small “edit” so the model answers correctly, without forgetting who the Queen was or how to write Python code.

However, a new paper titled “AKEW: Assessing Knowledge Editing in the Wild” suggests that our current methods for testing these edits are a bit of an illusion. Current benchmarks are too clean, too structured, and too artificial. In the messy reality of the “wild”—where knowledge comes from news articles and Wikipedia entries rather than neat databases—our best editing methods often fall apart.

In this post, we will break down the AKEW framework, explore why current methods fail on practical data, and look at a new dataset that captures the complexity of real-world knowledge updates.

The Problem with Current Evaluations

To understand why AKEW is necessary, we first need to look at how researchers currently test knowledge editing.

Typically, evaluations rely on Structured Facts. These are simple triplets formatted as (Subject, Relation, Object).

Example: (United Kingdom, head of government, Rishi Sunak)

Researchers take a dataset of these triplets, feed them into an editing algorithm (like ROME or MEMIT), and check if the model updates its answer.

The problem? Real-world knowledge doesn’t arrive in triplets. It comes in Unstructured Text. When a new Prime Minister is elected, we don’t get a JSON file; we get a BBC news article or a biography update.

The authors of AKEW argue that by ignoring unstructured text, the community has been solving a simplified version of the problem. As illustrated below, there is a massive difference between editing a single triplet and ingesting a biography.

Illustration of AKEW comparing structured facts, unstructured facts, and extracted triplets.

Figure 1 above perfectly encapsulates the AKEW approach. It proposes three levels of evaluation complexity:

Structured Fact: The “easy” mode. A clean triplet.
Unstructured Fact: The “hard” mode. A natural language paragraph containing the new information (and a lot of noise).
Extracted Triplets: The “bridge” mode. Using algorithms to pull multiple triplets out of the text to help the model.

As you can see in the diagram, while methods might get the structured fact right (Green Check), they frequently fail when faced with the raw text or even the extracted triplets (Red Crosses).

AKEW: A Benchmark for Reality

The researchers introduced AKEW (Assessing Knowledge Editing in the Wild) to bridge this gap. This benchmark isn’t just a new test; it’s a comprehensive suite involving new datasets and new evaluation protocols.

1. The Three Editing Settings

The core innovation here is testing the same knowledge update in three different “shapes”:

Structured Fact: This serves as the baseline. It uses a single isolated triplet. Most existing methods are optimized for this.
Unstructured Fact: This uses a paragraph of text. For example, if the update is about Rishi Sunak becoming PM, the input is a biography of Sunak. This is much harder because the model (or the editing algorithm) has to figure out which part of the text matters.
Extracted Triplets: Here, the researchers use an automated tool (like ChatGPT) to scan the unstructured text and pull out all relevant triplets. Instead of one triplet, you might get five or six that describe the subject. This tests if more structured data helps or hinders the editing process.

2. Creating the Data

One of the major criticisms raised in the paper is that previous benchmarks relied too heavily on “Counterfactuals”—fake facts like “The iPhone 5 was produced by Iveco.” While useful for checking if an edit took, these don’t reflect the complexity of real-world updates.

AKEW introduces three datasets to test these settings:

A. Counterfactual Updates (with a twist)

The authors took existing datasets (COUNTERFACT and MQUAKE-CF) and upgraded them. Since these datasets only had structured triplets, the authors had to generate corresponding unstructured texts.

They used an LLM to write “Wikipedia-style” paragraphs based on the fake facts.

An example of generating Wikipedia-style paragraphs.

As shown in Figure 2, if the fake fact is “iPhone 5 produced by Iveco,” the system generates a convincing (but factually wrong) paragraph describing this alternative reality. This ensures that the unstructured text matches the structured target exactly.

B. Real-World Updates (WIKIUPDATE)

This is the most exciting contribution. The authors built a brand new dataset called WIKIUPDATE based on actual historical changes.

The construction process, visualized below, is quite rigorous:

Construction process of WIKIUPDATE.

Figure 3 outlines the pipeline for creating WIKIUPDATE:

Data Preparation: They scan Wikidata for triplets that have “Time Qualifiers” (start time, end time). This allows them to see things that change over time (like Heads of Government).
Update Discovery: They look for instances where the “Object” changed after a specific date (April 2021). For example, changing from Boris Johnson to Rishi Sunak.
Unstructured Formulation: They retrieve the actual Wikipedia summaries for the entities involved.
Triplets Extraction: They extract all relevant knowledge from that text into triplets.

This dataset represents “Knowledge Editing in the Wild” because it uses real, messy, human-written text about events that actually happened.

How Do State-of-the-Art Methods Perform?

The authors tested several leading knowledge editing methods. These fall into two broad categories:

Locate-Then-Edit (e.g., ROME, MEMIT): These methods try to find the specific neurons responsible for a fact and update them.
In-Context Learning / Memory (e.g., IKE, MeLLo): These don’t change the model’s weights. Instead, they store the new fact in an external memory and retrieve it to use as context when answering questions (similar to RAG - Retrieval Augmented Generation).

They also tested standard Fine-Tuning (FT) and Low-Rank Adaptation (LoRA).

The Results: A Reality Check

The results were stark. When moving from structured triplets to unstructured text, performance collapsed for almost every method.

Editing accuracy table comparing different methods and datasets.

Let’s look closely at Table 2 (above). The numbers represent “Editing Accuracy”—the percentage of times the model successfully used the new knowledge.

The “Struct” Column: Look at ROME or MEMIT on the CounterFact dataset. They score incredibly high (90%+). This confirms that on the easy, structured tasks, these methods work as advertised.
The “Unstruct” Column: Now look at the Unstruct column.
Fine-Tuning (FT) & LoRA: Their performance drops by 79% to 100%. They essentially fail to learn the update from raw text.
ROME & MEMIT: These methods cannot even run on unstructured text directly because they require strict triplet inputs.
The “Extract” Column: This is where we feed the methods triplets extracted from the text. You might expect this to fix the problem. It helps slightly for Fine-Tuning, but performance is still abysmal compared to the clean structured baseline.

Why Do They Fail?

The drop in performance highlights a critical weakness. Current editing methods are “brittle.” They rely on the input being precise and isolated.

Unstructured facts are complex. A news article or biography contains linguistic variety, noise, and multiple interconnected facts. When you try to edit a model using this data, the signal gets lost.

In-Context Learning (IKE) performs best. If you look at the IKE rows in Table 2, you’ll see it drops much less than the others (only about 8-20% drop on unstructured data). This makes sense: IKE effectively pastes the text into the prompt context. LLMs are very good at reading a paragraph and answering a question based on it. However, even IKE struggles when the retrieval step fails.

The Difficulty of Real-World Data

The WIKIUPDATE dataset (the real-world data) proved to be the hardest challenge of all.

Look at the WikiUpdate section in Table 2. Even IKE, which did okay on the fake data, sees a massive drop (up to 52%) on real-world unstructured text.

Why? Real-world text is longer and denser. The average length of a structured fact is roughly 10 tokens. The average length of an unstructured fact in WIKIUPDATE is nearly 200 tokens.

Error Analysis: Where does it go wrong?

The researchers dug deeper into why the methods failed, specifically looking at the “Extracted Triplets” setting. If we extract the facts automatically, why can’t the model learn them?

Error types and their estimated proportions.

Table 4 provides a breakdown of errors for the MEMIT method.

Triplet Error (22%): Sometimes the extraction tool just does a bad job. It might miss the key information or produce an ambiguous triplet.
Editing Error (78%): The vast majority of failures happen during the edit. Even when the triplet is correct, the method fails to inject it properly.

This is likely because extracted triplets often come in groups. A biography yields 10 different facts about a person. Methods like MEMIT are designed to edit one specific association. When you flood them with 10 related updates at once, the “surgical” precision is lost, and the edit fails.

Conclusion and Future Directions

The AKEW paper serves as a necessary wake-up call for the Knowledge Editing community. We have become very good at updating models on clean, laboratory-grade data. But in the wild, data is messy, unstructured, and interconnected.

Key Takeaways:

The Gap is Real: There is a massive performance disparity between editing with structured triplets vs. unstructured text.
Extraction Isn’t Enough: Simply extracting triplets from text doesn’t solve the problem; current methods struggle to ingest multiple extracted facts simultaneously.
Real-World Data is Harder: Synthetic counterfactuals (fake facts) are easier to edit than real-world updates, likely due to the complexity and length of real context.

What’s Next? The authors suggest two paths forward:

For Locate-Then-Edit methods: We need algorithms that can handle batch updates of related facts. The “one triplet, one edit” paradigm is too limiting.
For In-Context methods: We need better retrieval systems that don’t get confused by complex, conflicting real-world documents.

As we move toward LLMs that can update themselves by reading the news, benchmarks like AKEW will be the standard by which we measure success. We aren’t there yet, but identifying the problem is the first step toward solving it.

The Problem with Current Evaluations#

AKEW: A Benchmark for Reality#

1. The Three Editing Settings#

2. Creating the Data#

A. Counterfactual Updates (with a twist)#

B. Real-World Updates (WIKIUPDATE)#

How Do State-of-the-Art Methods Perform?#

The Results: A Reality Check#

Why Do They Fail?#

The Difficulty of Real-World Data#

Error Analysis: Where does it go wrong?#

Conclusion and Future Directions#