Introduction
Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with information. They can write poetry, code websites, and answer questions on a vast array of topics. However, for all their brilliance, they have a notorious flaw: “hallucination.” When an LLM doesn’t know a specific fact—or when that fact is obscure or outdated—it often makes things up with supreme confidence.
To combat this, researchers rely on Retrieval-Augmented Generation (RAG). The idea is simple: before the LLM answers a question, we retrieve relevant data from an external source (like a database or a document) and feed it to the LLM as context.
One of the most structured and reliable sources for this external data is a Knowledge Graph (KG). KGs store facts as triples (Subject, Relation, Object), such as (Larry Page, founder, Google). But here lies the problem: LLMs are trained on natural language text, not structured graph data.
How do we bridge the gap between a structured Knowledge Graph and a text-based LLM?
The naive approach is to simply dump the raw data into the prompt. However, recent research shows that this often overwhelms the model or misses the semantic nuance required for complex reasoning.
In this post, we will dive deep into a paper titled “CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering.” This research proposes a fascinating solution: instead of just retrieving data, we use a specialized “Rewriter” model to think step-by-step, extracting and translating graph data into a coherent narrative that the answering model can actually use.
We will explore how this method, CoTKR, uses Chain-of-Thought reasoning to rewrite knowledge, and how a novel training strategy called PAQAF aligns this rewriting process with the ultimate goal of getting the right answer.
Background: The Knowledge Rewriting Bottleneck
Before dissecting the new method, we need to understand the current landscape of Knowledge Graph Question Answering (KGQA).
When you ask a complex question like “Where did the founder of Google go to college?”, a KGQA system typically performs two steps:
- Retrieval: It searches the Knowledge Graph to find a “subgraph”—a cluster of entities and relationships relevant to “Google,” “Founder,” and “College.”
- Answering: It feeds this subgraph to an LLM to generate the final answer.
The critical bottleneck is Knowledge Rewriting (KR). This is the process of converting that retrieved subgraph into a text format the LLM can process.

As shown in Figure 1, there are three standard approaches, each with significant flaws:
- Simple Linear Concatenation (Triple): This method takes the raw data
(Larry Page, attend, Stanford)and feeds it directly to the LLM.
- The Problem: It lacks semantic coherence. It’s just a list of data points. For complex queries, this unstructured “data dump” forces the LLM to do too much processing, often leading to errors.
- KG-to-Text: This uses a model to turn every triple into a full sentence.
- The Problem: It is incredibly verbose. As seen in the figure, it generates a wall of text containing redundant information (“Google is headquartered in Mountain View…”) that has nothing to do with the user’s question about college.
- Summary: This attempts to summarize the subgraph into a paragraph relative to the question.
- The Problem: While better, single-step summarization often misses the logic required for the answer. It might mention Larry Page but accidentally omit Sergey Brin because it tried to compress too much information too quickly.
The authors of CoTKR identified that these methods fail because they are static. They don’t reason about what information is actually needed. They simply transform data.
The Solution: Chain-of-Thought Enhanced Knowledge Rewriting (CoTKR)
To solve the limitations of “single-step” rewriting, the researchers propose CoTKR. The core philosophy here is that rewriting knowledge shouldn’t be a one-shot task. It should be an iterative process of reasoning and extraction.
Inspired by the ReAct (Reasoning and Acting) paradigm, CoTKR treats knowledge rewriting as a dialogue between “thinking” and “doing.”
The Interleaved Process
Instead of converting the whole graph at once, CoTKR alternates between two specific operations:
- Reasoning: The model looks at the question and the graph and thinks, “What specific piece of information do I need to look for next?”
- Summarization: Based on that reasoning, the model extracts the specific relevant facts from the graph and writes them down.

Let’s look at Figure 2. The user asks: “Where did the founder of Google go to college?”
A standard model might just spit out every fact about Google. CoTKR, however, produces a Chain-of-Thought (CoT):
- Step 1 Reasoning: “I need to know who the founders of Google are.”
- Step 1 Knowledge: “The founders of Google are Sergey Brin and Larry Page.”
- Step 2 Reasoning: “I need to know where Sergey Brin and Larry Page went to college.”
- Step 2 Knowledge: “Sergey Brin went to the University of Maryland… Larry Page went to the University of Michigan and Stanford.”
By breaking the problem down, the resulting text is structured, logical, and contains exactly the evidence needed to answer the question—no more, no less.
The Mathematics of Iteration
We can formalize this process. Let \(X_{t-1}\) represent the history of reasoning and knowledge generated so far.

At any step \(t\), the Rewriter model (\(R\)) first looks at the question (\(q\)), the retrieved subgraph (\(G'\)), and the history (\(X_{t-1}\)) to generate a reasoning trace (\(x_{t,r}\)):

Next, utilizing that specific reasoning trace as a guide, the model extracts the knowledge summary (\(x_{t,k}\)):

This cycle repeats until the model determines it has sufficient information. The final output is a coherent narrative that guides the downstream QA model to the correct answer.
Training Strategy: PAQAF
One of the most innovative aspects of this paper is not just how the model works, but how it is trained.
If you want to train a model to rewrite knowledge graphs, how do you define “success”? Is a summary successful if it is grammatically correct? Not necessarily. A summary is only successful if it helps the QA model get the right answer.
The researchers introduce PAQAF: Preference Alignment from Question Answering Feedback. This strategy aligns the Rewriter’s output with the QA model’s needs.

As illustrated in Figure 3, the training process is split into two stages.
Stage 1: Supervised Fine-Tuning (SFT)
Initially, the model needs to learn the basic mechanics of rewriting. The researchers use ChatGPT to generate “gold standard” examples of reasoning and summarization. They then train a smaller, open-source model (like Llama-2) to mimic these examples using standard Supervised Fine-Tuning loss:

This teaches the model how to rewrite, but it doesn’t teach it what the best rewrite looks like for a difficult question.
Stage 2: Alignment via Feedback
This is where PAQAF shines. The goal is to bridge the “preference gap” between the Rewriter and the QA model.
- Sampling: The system generates multiple different rewrites (candidate representations) for the same question.
- Testing: Each rewrite is fed into the QA model.
- Feedback:
- If Rewrite A leads to the correct answer, it is marked as a “Preferred” (\(k^+\)).
- If Rewrite B leads to an incorrect answer, it is marked as “Dispreferred” (\(k^-\)).
- Optimization: The model is then updated using Direct Preference Optimization (DPO).
The DPO loss function effectively increases the probability of generating the helpful rewrite (\(k^{++}\)) and decreases the probability of the unhelpful one (\(k^{-}\)):

Where \(r(x, k)\) represents the ratio of probabilities between the new policy and the reference model:

By optimizing for the QA model’s success rather than just linguistic quality, CoTKR ensures that the generated text is practically useful.
Experiments and Results
To validate their method, the researchers tested CoTKR against the baseline methods (Triple, KG-to-Text, and Summary) on two major benchmarks: GrailQA and GraphQuestions.
They evaluated performance using three metrics:
- Accuracy (Acc): Does the response contain at least one correct answer entity?
- Recall: What proportion of the correct answer entities are found?
- Exact Match (EM): Does the response contain all correct answer entities?
Overall Performance
The results were compelling. Table 1 below highlights the performance across different setups.

Key Takeaways from the Data:
- CoTKR Dominance: Across almost every combination of Rewriter (Llama-2, Llama-3, ChatGPT) and QA Model (ChatGPT, Mistral), CoTKR+PA (Preference Alignment) achieves the highest scores.
- The “Triple” Myth: A common belief in RAG research is that LLMs process raw triples better than text. However, the data shows that Triple often outperforms naive text conversion (KG-to-Text) but consistently loses to CoTKR. This proves that natural language is better for LLMs, provided it is structured and reasoned correctly.
- Impact of Alignment: The jump from CoTKR to CoTKR+PA is significant, validating that training the rewriter based on QA feedback is crucial.
Robustness Across Retrieval Methods
A good rewriting method should work regardless of how the data was found. The researchers tested CoTKR with three different retrieval strategies:
- 2-Hop: Standard retrieval of neighbors within 2 hops in the graph.
- BM25: Keyword-based retrieval.
- GS (Ground Truth Subgraph): The ideal scenario where the retrieval is perfect.

Figure 4 shows the results. Regardless of whether the retrieval quality is low (2-Hop) or perfect (GS), CoTKR consistently pushes the accuracy ceiling higher than the baselines.
Where Does CoTKR Win?
The researchers performed a comparative analysis to see exactly where CoTKR improves over the “Triple” method.

Figure 5 shows pie charts representing the shift in correctness. The Green slices represent questions where the Triple method failed, but CoTKR succeeded (“Incorrect -> Correct”). The CoTKR slices are notably healthy, indicating it can salvage difficult questions that raw data dumps cannot solve.
A Concrete Example
To truly appreciate the difference, let’s look at a specific example from the study (Table 8).
Question: “What is the unit of area that the measurement system that has an electric field strength units of volt per metre have?”
This is a convoluted question that requires hopping through multiple facts.

- Triple: Dumps raw data about units. The QA model simply answers “Square meter.”
- Summary: Gets confused by the volume of data and fails to connect the “volt per meter” requirement to the “area” requirement, resulting in a vague answer.
- CoTKR:
- Reasoning: “I need to find the measurement system with… Volt per meter.”
- Knowledge: Finds the International System of Units.
- Reasoning: “I need to identify the unit of area within this system.”
- Knowledge: Finds Square kilometer and Square meter.
Because CoTKR broke the problem down, the QA model correctly identifies both “Square kilometer” and “Square meter.”
Conclusion and Implications
The CoTKR paper presents a significant step forward in Retrieval-Augmented Generation. It addresses a fundamental disconnect in Knowledge Graph QA: the mismatch between structured data storage and unstructured LLM reasoning.
By implementing a Chain-of-Thought process within the rewriting stage, the system acts less like a simple translator and more like an investigator, gathering evidence step-by-step. Furthermore, the PAQAF training strategy introduces a feedback loop that ensures this investigator is actually helpful to the final decision-maker.
Key Takeaways:
- Interleaved Reasoning is King: Alternating between “thinking” and “retrieving” reduces noise and improves focus.
- Feedback Loops Matter: Training a module based on the success of the downstream task (QA) is more effective than training it in isolation.
- Natural Language > Raw Data: Contrary to some beliefs, LLMs prefer well-structured natural language over raw graph triples, provided the language is generated via robust reasoning.
As we move toward more complex AI systems, methods like CoTKR demonstrate that the quality of the “context” we feed our models is just as important as the models themselves. Future work may see these techniques applied beyond Knowledge Graphs, optimizing how LLMs consume tables, long documents, and code repositories.
](https://deep-paper.org/en/paper/2409.19753/images/cover.png)