Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized how we interact with information. They are incredibly capable, but they have well-documented flaws: they hallucinate (make things up) and their knowledge is static—cut off at their training date.
To fix this, the AI community has largely turned to Knowledge Graphs (KGs). By grounding an LLM’s responses in a structured database of facts (triples like Apple, Headquartered_In, Cupertino), we can theoretically get the best of both worlds: the reasoning of an LLM and the factual accuracy of a database. This field is known as Knowledge Graph Question Answering (KGQA).
But there is a hidden assumption in most KGQA research: we assume the Knowledge Graph is perfect.
In academic benchmarks, KGs are often treated as complete repositories of truth. If you ask a question, the answer is assumed to be somewhere in the graph. But in the real world, Knowledge Graphs are messy, sparse, and incomplete. What happens when the LLM looks for an answer in the graph, but the specific link it needs just isn’t there?
This brings us to a new research paper titled “Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering.” The researchers behind this work propose a novel method called Generate-on-Graph (GoG). It bridges the gap between structured data and LLM intuition, allowing systems to answer complex questions even when the database is missing critical facts.
The Problem with “Complete” Thinking
To understand why this paper is significant, we first need to look at how current systems fail.
Imagine you ask an AI: “What is the time zone of the area where Apple’s headquarters is located?”
In a perfect world (Scenario B in the image below), the system looks at the Knowledge Graph. It traces a path:
- Apple Inc \(\rightarrow\) Headquartered In \(\rightarrow\) Cupertino
- Cupertino \(\rightarrow\) Time Zone \(\rightarrow\) Pacific Standard Time
The answer is found purely by traversing existing links.

However, look at Scenario (c) in Figure 1. This represents Incomplete Knowledge Graph QA (IKGQA). Here, the link between “Cupertino” and “Pacific Standard Time” is missing from the database.
A standard KGQA system would hit a dead end at Cupertino and return “I don’t know.” Conversely, an LLM acting alone (Scenario a) might hallucinate the location of the headquarters entirely.
The researchers argue that to solve this, we need to stop treating the LLM and the KG as separate entities where one purely queries the other. Instead, we should treat the LLM as both an Agent (who directs the search) and a Knowledge Graph (who supplies missing facts).
The Landscape of LLM + KG
Before diving into the solution, it is helpful to categorize how we currently combine these technologies. As shown in Figure 2, there are generally two existing paradigms, and then the new proposed method.

- Semantic Parsing (SP): The LLM acts as a translator. It converts a natural language question into a logical query (like SPARQL or SQL). This query is executed against the database.
- The Flaw: If the database is missing a link, the query returns nothing. The LLM has no chance to intervene.
- Retrieval Augmented (RA): The system retrieves relevant paths from the KG and feeds them into the LLM as context.
- The Flaw: These methods rely heavily on finding a “reasoning path” in the graph. If the path is broken due to missing data, the retrieval step fails to provide the necessary context, and the LLM is left guessing.
The Solution: Generate-on-Graph (GoG) (shown in panel c) takes a different approach. It acknowledges that while the KG is accurate, it is sparse. The LLM, while prone to hallucination, is dense with general knowledge. GoG creates a feedback loop where the LLM can generate new nodes and edges for the graph on the fly when it hits a dead end.
The Generate-on-Graph (GoG) Method
The core of the GoG method is a framework called Thinking-Searching-Generating. It transforms the LLM into an agent that iteratively explores the graph, but with a superpower: if it can’t find a fact, it attempts to “remember” it.
Let’s break down the architecture step-by-step using the example: “What is the time zone of the area where Apple headquarters is located?”

1. Thinking (Planning)
The process begins with the LLM acting as a planner. It analyzes the question and decomposes it.
- Thought 1: “First, I need to find out where Apple’s headquarters are.”
- Decision: The agent decides it needs to search the Knowledge Graph for “Apple Inc.”
2. Searching (The Agent Role)
The LLM uses tools to execute SPARQL queries on the actual Knowledge Graph.
- Action:
Search[Apple Inc] - Result: The KG returns a subgraph of neighbors. It finds a triple:
(Apple Inc, headquarter, Cupertino). - Filtering: The LLM filters out irrelevant noise (like “Apple Inc, CEO, Tim Cook”) because the current thought is about location.
3. Generating (The KG Role)
This is the pivotal contribution of the paper.
- Thought 2: “Okay, the headquarters are in Cupertino. Now I need the time zone of Cupertino.”
- Action:
Search[Cupertino] - Observation: The KG returns links like
(Cupertino, located_in, California)and(Cupertino, adjoin, Palo Alto). Crucially, the triple for Time Zone is missing.
In a standard system, the process dies here. But in GoG, the system recognizes the gap. It triggers the Generate action.
- Action:
Generate[Cupertino, timezone] - Process: The LLM looks at the retrieved context (Cupertino is in California, near Palo Alto) and uses its internal pre-trained weights to generate new factual triples.
- Result: The LLM generates:
(Cupertino, timezone, Pacific Standard Time).
4. Verifying and Finishing
The system doesn’t just blindly trust the generation. It performs a verification step (using the LLM) to check if the generated triple makes sense given the context. Once verified, this new triple is treated as a temporary part of the graph.
- Final Thought: “The time zone is Pacific Standard Time.”
- Action:
Finish[Pacific Standard Time].
Experimental Setup: Breaking the Graphs
To prove this method works, the researchers couldn’t just use standard datasets—they had to break them first.
They took two popular benchmarks, WebQuestionSP (WebQSP) and Complex WebQuestion (CWQ), and created “Incomplete” versions (IKGQA datasets). They did this by identifying the crucial triples—the specific facts needed to answer a question—and randomly deleting them at different rates (20%, 40%, 60%, and 80%).
This creates a brutal test environment. If a method relies 100% on the graph (like Semantic Parsing), its performance should theoretically drop to near zero as the incompleteness rises.
The Results
The performance of GoG was compared against several baselines, including:
- Prompting methods: Standard IO, Chain-of-Thought (CoT).
- Semantic Parsing: ChatKBQA.
- Retrieval Augmented: Think-on-Graph (ToG), StructGPT.
The results, shown in Table 1, highlight the robustness of the GoG approach.

Key Takeaways from the Data:
- Dominance on Incomplete Graphs: Look at the
IKGcolumns. On the Complex WebQuestion (CWQ) dataset, GoG achieves a score of 44.3%, significantly outperforming the previous state-of-the-art, ToG (37.9%). On WebQSP, the gap is even wider (66.6% vs 63.4%). - Semantic Parsing Collapses: As predicted, methods like ChatKBQA fall apart on incomplete graphs (dropping from 76.5% to 39.3% on CWQ). They simply cannot function when the query returns
NULL. - Better even on Complete Graphs: Surprisingly, GoG often performs better on the complete graph (CKG) as well. This is likely because its “Thinking” step allows for better planning and subgraph exploration than the beam-search methods used by competitors like ToG.
Impact of Missing Data
The researchers pushed the models to the breaking point by removing up to 80% of the crucial information.
The graph below (Figure 4) visualizes how GoG performs based on the number of “related triples” it retrieves during the generation phase.

This chart reveals an interesting dynamic: Context matters. GoG performs best when it has found some relevant information (like neighboring cities or state information) before it attempts to generate the missing link. If it has 0 context, performance is lower. If it has too much context (20+ triples), performance also dips, likely due to noise/distraction.
Why GoG Wins: Handling Complex Data
One of the subtle strengths of GoG is its ability to handle Compound Value Types (CVTs).
In Knowledge Graphs like Freebase, not everything is a simple Subject -> Predicate -> Object link. Sometimes you have an intermediate node representing an event. For example, “Brad Paisley’s Education” might be a node that links to “Institution: Belmont” and “Degree: Bachelor.”

Standard retrieval methods often get “stuck” at the CVT node because it’s not the final answer. GoG’s iterative “Thinking” step allows it to recognize these nodes and decide to search one hop further, effectively navigating complex graph structures that baffle other agents.
Limitations and Errors
No system is perfect. While GoG mitigates hallucination by grounding most steps in the graph, it re-introduces the risk during the Generate phase.

As seen in Figure 6, Hallucination (green) accounts for about 30-34% of errors. This happens when the LLM confidently generates a missing triple that is incorrect. For example, if the graph is missing the team a coach manages, the LLM might guess the most famous team associated with that coach, even if the question asks about a specific time period.
However, note the blue sections (“Decompose Error”). GoG is quite good at planning; logic errors are less frequent than hallucinations or simple matching errors.
Conclusion: The Future of Hybrid AI
The paper “Generate-on-Graph” represents a mature step forward in Neuro-Symbolic AI (the combination of neural networks and symbolic logic).
By accepting that Knowledge Graphs will always be incomplete, the researchers moved the goalposts. We no longer need a perfect database to perform structured reasoning. Instead, we can use the Knowledge Graph as a reliable map, and the LLM as an intelligent scout that can fill in the blank spots on that map.
For students and researchers, the key takeaways are:
- Real-world robustness: Evaluation metrics should prioritize incomplete, messy data over sanitized academic benchmarks.
- Dual Roles: LLMs shouldn’t just be “interfaces” for databases. They should be active participants in constructing the data needed to answer a query.
- Iterative Reasoning: The “Thinking-Searching-Generating” loop is a powerful paradigm for handling complex, multi-hop questions.
As LLMs continue to evolve, methods like GoG will likely become the standard for interacting with enterprise data, where perfect data curation is an impossible dream.
](https://deep-paper.org/en/paper/2404.14741/images/cover.png)