Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized how we interact with information. They are incredibly capable, but they have well-documented flaws: they hallucinate (make things up) and their knowledge is static—cut off at their training date.

To fix this, the AI community has largely turned to Knowledge Graphs (KGs). By grounding an LLM’s responses in a structured database of facts (triples like Apple, Headquartered_In, Cupertino), we can theoretically get the best of both worlds: the reasoning of an LLM and the factual accuracy of a database. This field is known as Knowledge Graph Question Answering (KGQA).

But there is a hidden assumption in most KGQA research: we assume the Knowledge Graph is perfect.

In academic benchmarks, KGs are often treated as complete repositories of truth. If you ask a question, the answer is assumed to be somewhere in the graph. But in the real world, Knowledge Graphs are messy, sparse, and incomplete. What happens when the LLM looks for an answer in the graph, but the specific link it needs just isn’t there?

This brings us to a new research paper titled “Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering.” The researchers behind this work propose a novel method called Generate-on-Graph (GoG). It bridges the gap between structured data and LLM intuition, allowing systems to answer complex questions even when the database is missing critical facts.

The Problem with “Complete” Thinking

To understand why this paper is significant, we first need to look at how current systems fail.

Imagine you ask an AI: “What is the time zone of the area where Apple’s headquarters is located?”

In a perfect world (Scenario B in the image below), the system looks at the Knowledge Graph. It traces a path:

  1. Apple Inc \(\rightarrow\) Headquartered In \(\rightarrow\) Cupertino
  2. Cupertino \(\rightarrow\) Time Zone \(\rightarrow\) Pacific Standard Time

The answer is found purely by traversing existing links.

Figure 1: Comparison between three Question Answering tasks. (a) LLM only QA: The model admits ignorance. (b) Knowledge Graph QA (KGQA): The model finds the full path in a complete graph. (c) Incomplete Knowledge Graph QA (IKGQA): The model combines graph data with internal knowledge to bridge the gap.

However, look at Scenario (c) in Figure 1. This represents Incomplete Knowledge Graph QA (IKGQA). Here, the link between “Cupertino” and “Pacific Standard Time” is missing from the database.

A standard KGQA system would hit a dead end at Cupertino and return “I don’t know.” Conversely, an LLM acting alone (Scenario a) might hallucinate the location of the headquarters entirely.

The researchers argue that to solve this, we need to stop treating the LLM and the KG as separate entities where one purely queries the other. Instead, we should treat the LLM as both an Agent (who directs the search) and a Knowledge Graph (who supplies missing facts).

The Landscape of LLM + KG

Before diving into the solution, it is helpful to categorize how we currently combine these technologies. As shown in Figure 2, there are generally two existing paradigms, and then the new proposed method.

Figure 2: Three paradigms for combining LLMs with KGs. (a) Semantic Parsing: converting text to query code. (b) Retrieval Augmented: Using the KG as context. (c) Generate-on-Graph: The proposed method where the LLM acts as both agent and knowledge source.

  1. Semantic Parsing (SP): The LLM acts as a translator. It converts a natural language question into a logical query (like SPARQL or SQL). This query is executed against the database.
  • The Flaw: If the database is missing a link, the query returns nothing. The LLM has no chance to intervene.
  1. Retrieval Augmented (RA): The system retrieves relevant paths from the KG and feeds them into the LLM as context.
  • The Flaw: These methods rely heavily on finding a “reasoning path” in the graph. If the path is broken due to missing data, the retrieval step fails to provide the necessary context, and the LLM is left guessing.

The Solution: Generate-on-Graph (GoG) (shown in panel c) takes a different approach. It acknowledges that while the KG is accurate, it is sparse. The LLM, while prone to hallucination, is dense with general knowledge. GoG creates a feedback loop where the LLM can generate new nodes and edges for the graph on the fly when it hits a dead end.

The Generate-on-Graph (GoG) Method

The core of the GoG method is a framework called Thinking-Searching-Generating. It transforms the LLM into an agent that iteratively explores the graph, but with a superpower: if it can’t find a fact, it attempts to “remember” it.

Let’s break down the architecture step-by-step using the example: “What is the time zone of the area where Apple headquarters is located?”

Figure 3: Comparison of three methods in solving IKGQA. While Semantic Parsing (a) and Path Retrieval (b) fail due to missing data or wrong paths, GoG (c) successfully bridges the gap by generating the missing ’timezone’ triple.

1. Thinking (Planning)

The process begins with the LLM acting as a planner. It analyzes the question and decomposes it.

  • Thought 1: “First, I need to find out where Apple’s headquarters are.”
  • Decision: The agent decides it needs to search the Knowledge Graph for “Apple Inc.”

2. Searching (The Agent Role)

The LLM uses tools to execute SPARQL queries on the actual Knowledge Graph.

  • Action: Search[Apple Inc]
  • Result: The KG returns a subgraph of neighbors. It finds a triple: (Apple Inc, headquarter, Cupertino).
  • Filtering: The LLM filters out irrelevant noise (like “Apple Inc, CEO, Tim Cook”) because the current thought is about location.

3. Generating (The KG Role)

This is the pivotal contribution of the paper.

  • Thought 2: “Okay, the headquarters are in Cupertino. Now I need the time zone of Cupertino.”
  • Action: Search[Cupertino]
  • Observation: The KG returns links like (Cupertino, located_in, California) and (Cupertino, adjoin, Palo Alto). Crucially, the triple for Time Zone is missing.

In a standard system, the process dies here. But in GoG, the system recognizes the gap. It triggers the Generate action.

  • Action: Generate[Cupertino, timezone]
  • Process: The LLM looks at the retrieved context (Cupertino is in California, near Palo Alto) and uses its internal pre-trained weights to generate new factual triples.
  • Result: The LLM generates: (Cupertino, timezone, Pacific Standard Time).

4. Verifying and Finishing

The system doesn’t just blindly trust the generation. It performs a verification step (using the LLM) to check if the generated triple makes sense given the context. Once verified, this new triple is treated as a temporary part of the graph.

  • Final Thought: “The time zone is Pacific Standard Time.”
  • Action: Finish[Pacific Standard Time].

Experimental Setup: Breaking the Graphs

To prove this method works, the researchers couldn’t just use standard datasets—they had to break them first.

They took two popular benchmarks, WebQuestionSP (WebQSP) and Complex WebQuestion (CWQ), and created “Incomplete” versions (IKGQA datasets). They did this by identifying the crucial triples—the specific facts needed to answer a question—and randomly deleting them at different rates (20%, 40%, 60%, and 80%).

This creates a brutal test environment. If a method relies 100% on the graph (like Semantic Parsing), its performance should theoretically drop to near zero as the incompleteness rises.

The Results

The performance of GoG was compared against several baselines, including:

  • Prompting methods: Standard IO, Chain-of-Thought (CoT).
  • Semantic Parsing: ChatKBQA.
  • Retrieval Augmented: Think-on-Graph (ToG), StructGPT.

The results, shown in Table 1, highlight the robustness of the GoG approach.

Table 1: The Hits@1 scores of different models over two datasets. CKG denotes Complete Knowledge Graph, and IKG-40% denotes a graph where 40% of crucial triples are deleted.

Key Takeaways from the Data:

  1. Dominance on Incomplete Graphs: Look at the IKG columns. On the Complex WebQuestion (CWQ) dataset, GoG achieves a score of 44.3%, significantly outperforming the previous state-of-the-art, ToG (37.9%). On WebQSP, the gap is even wider (66.6% vs 63.4%).
  2. Semantic Parsing Collapses: As predicted, methods like ChatKBQA fall apart on incomplete graphs (dropping from 76.5% to 39.3% on CWQ). They simply cannot function when the query returns NULL.
  3. Better even on Complete Graphs: Surprisingly, GoG often performs better on the complete graph (CKG) as well. This is likely because its “Thinking” step allows for better planning and subgraph exploration than the beam-search methods used by competitors like ToG.

Impact of Missing Data

The researchers pushed the models to the breaking point by removing up to 80% of the crucial information.

The graph below (Figure 4) visualizes how GoG performs based on the number of “related triples” it retrieves during the generation phase.

Figure 4: The Hits@1 scores of GoG with different numbers of related triples in the Generate Action. Performance generally peaks when the model has access to a moderate amount of context (5-10 triples).

This chart reveals an interesting dynamic: Context matters. GoG performs best when it has found some relevant information (like neighboring cities or state information) before it attempts to generate the missing link. If it has 0 context, performance is lower. If it has too much context (20+ triples), performance also dips, likely due to noise/distraction.

Why GoG Wins: Handling Complex Data

One of the subtle strengths of GoG is its ability to handle Compound Value Types (CVTs).

In Knowledge Graphs like Freebase, not everything is a simple Subject -> Predicate -> Object link. Sometimes you have an intermediate node representing an event. For example, “Brad Paisley’s Education” might be a node that links to “Institution: Belmont” and “Degree: Bachelor.”

Figure 5: An example of compound value types (CVTs). These complex structures confuse many path-retrieval algorithms.

Standard retrieval methods often get “stuck” at the CVT node because it’s not the final answer. GoG’s iterative “Thinking” step allows it to recognize these nodes and decide to search one hop further, effectively navigating complex graph structures that baffle other agents.

Limitations and Errors

No system is perfect. While GoG mitigates hallucination by grounding most steps in the graph, it re-introduces the risk during the Generate phase.

Figure 6: The error proportions of GoG. ‘Hallucination’ remains a significant source of error, alongside ‘False Negatives’ (often due to exact string matching failures).

As seen in Figure 6, Hallucination (green) accounts for about 30-34% of errors. This happens when the LLM confidently generates a missing triple that is incorrect. For example, if the graph is missing the team a coach manages, the LLM might guess the most famous team associated with that coach, even if the question asks about a specific time period.

However, note the blue sections (“Decompose Error”). GoG is quite good at planning; logic errors are less frequent than hallucinations or simple matching errors.

Conclusion: The Future of Hybrid AI

The paper “Generate-on-Graph” represents a mature step forward in Neuro-Symbolic AI (the combination of neural networks and symbolic logic).

By accepting that Knowledge Graphs will always be incomplete, the researchers moved the goalposts. We no longer need a perfect database to perform structured reasoning. Instead, we can use the Knowledge Graph as a reliable map, and the LLM as an intelligent scout that can fill in the blank spots on that map.

For students and researchers, the key takeaways are:

  1. Real-world robustness: Evaluation metrics should prioritize incomplete, messy data over sanitized academic benchmarks.
  2. Dual Roles: LLMs shouldn’t just be “interfaces” for databases. They should be active participants in constructing the data needed to answer a query.
  3. Iterative Reasoning: The “Thinking-Searching-Generating” loop is a powerful paradigm for handling complex, multi-hop questions.

As LLMs continue to evolve, methods like GoG will likely become the standard for interacting with enterprise data, where perfect data curation is an impossible dream.