Large Language Models (LLMs) like GPT-4 are incredibly impressive. They can write poetry, debug code, and summarize history. But if you have ever used one for research or critical decision-making, you likely know their Achilles’ heel: hallucination. They can sound completely confident while being completely wrong.
This problem becomes even more acute when we move away from “mainstream” knowledge (like “Who is the President of the US?”) to “long-tail” knowledge (obscure facts, recent events, or specific database entries).
Researchers have tried to fix this by connecting LLMs to Knowledge Graphs (KGs)—structured databases of facts. This field is called Knowledge Graph Question Answering (KGQA). While this works great for simple trivia, it often breaks down when a question requires commonsense reasoning combined with facts.
In this post, we will deep dive into a fascinating paper titled “Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering”. The researchers introduce a framework called \(R^3\) (Right for Right Reasons). It is a methodology designed to force LLMs to explain their logic first and then prove their answers using hard data, making the reasoning process verifiable and significantly reducing hallucinations.
The Problem: When Facts Aren’t Enough
To understand why \(R^3\) is necessary, we need to distinguish between two types of questions: Factoid and Commonsense.
- Factoid Question: “In which city was Silvio Berlusconi’s first wife born?”
- This is a lookup task. If your database (Knowledge Graph) has the link
(Silvio Berlusconi) -> (spouse) -> (Person A)and(Person A) -> (birthplace) -> (City B), you can answer it.
- Commonsense Question: “Do I need separate visas to see the Venus of Willendorf and attend the Olympics this summer?”
- A simple database lookup fails here. The Knowledge Graph (KG) doesn’t have a triple that says
(Venus of Willendorf, visa_relationship, Olympics). - To answer this, you need facts (Where is the Venus of Willendorf? Where are the Olympics?) AND commonsense (If two locations are in the Schengen Area, do I need two visas?).
Current LLM-based approaches struggle here. If you ask an LLM directly, it might hallucinate that the Venus of Willendorf is in a different country, or it might apply the wrong visa rules. It acts as a “black box”—you get an answer, but you can’t verify if the logic holds up.
The Solution: The \(R^3\) Framework
The core idea behind Right for Right Reasons (\(R^3\)) is simple but powerful: Don’t let the LLM guess. Instead, force it to:
- State the general “rule” (axiom) it believes applies to the situation.
- Go into the database (KG) and find specific evidence to support that rule.
- If evidence is missing, figure out exactly what is missing and go look for it.
This turns the QA process into a verifiable tree search. Let’s break down the architecture.
The Workflow Overview
The diagram below illustrates the \(R^3\) process. It starts with a question, extracts entities, and then enters a reasoning loop.

As shown in Figure 1, the system doesn’t just jump to an answer. It splits the problem into branches. In the example provided (“Do I need separate visas…?”), the system explores different reasoning paths (axioms). One path might check if a specific visa covers both events. Another path checks if the host countries have open borders (Schengen Area). The system only “succeeds” when it finds hard facts in the Knowledge Graph that satisfy one of these logical paths.
Let’s walk through the specific steps that make this happen.
Step 1: Getting the Right Entities
Before we can reason, we need to know what we are talking about. Standard “Entity Linkers” (tools that find keywords in text and match them to database IDs) often fail with obscure or complex queries.
\(R^3\) uses a hybrid approach. It combines a traditional entity linker with the LLM itself. The LLM is asked to identify relevant entities that the linker might have missed.

This equation simply means the set of entities (\(\mathcal{E}^q\)) is the combination (\(\cup\)) of what the standard tool finds (\(\text{EL}\)) and what the LLM identifies (\(\text{LLM}_E\)). Once the entities are found, the system pulls the “1-hop neighborhood” (all immediate connections) from the Knowledge Graph.

This gives us a “raw pile of facts” (\(\mathcal{K}^q\)) related to the question.
Step 2: Surfacing Commonsense Axioms
This is the most innovative part of the paper. Most systems let the LLM implicitly reason inside its hidden layers. \(R^3\) forces the LLM to output a Commonsense Axiom.
An axiom is a logical template that says: “If conditions X, Y, and Z are true, then the answer is A.”
For example, for the question regarding the Italian politician Virginia Raggi asking for a quinceañera:
- The Axiom: “If Virginia Raggi is a girl from Latin America AND her age is near 15, THEN it makes sense.”
The system formalizes this into a structure that looks like First-Order Logic:

Here, \(P_i\) represents predicates (properties like “is a girl”) and \(F_i\) represents functions (like comparisons of age). By making the LLM write this down explicitly, we can check its work. If the LLM has a wrong assumption (e.g., thinking quinceañera is for 30-year-olds), we can see that error immediately in the axiom.
Step 3: Sub-graph Pruning
The raw sub-graph we pulled in Step 1 might contain hundreds of irrelevant facts (e.g., the politician’s shoe size). Feeding all this to an LLM might confuse it or exceed its context window.
\(R^3\) uses a “Sub-graph Pruning” (SGP) module. It looks for facts that are semantically similar to the Axiom generated in Step 2.

It keeps the top-k facts based on vector similarity (\(sim\)) and also asks the LLM to pick out any other relevant facts (\(\text{LLM}_T\)). This ensures we have a clean, focused set of evidence.
Step 4: Fact-Grounded Answer Selection
Now the system tries to “ground” the axiom. It checks the Knowledge Graph facts against the conditions in the axiom.
- Axiom condition: “Is Virginia Raggi from Latin America?”
- KG Fact:
(Virginia Raggi, place_of_birth, Rome, Italy)
The system compares the two. Since Rome is not in Latin America, the premise fails. The answer is derived strictly from this comparison.

The answer can be “True”, “False”, or importantly, “I don’t know.” If the facts aren’t there, the system admits ignorance rather than hallucinating.
Step 5: Iterative Multi-hop Reasoning
If the answer is “I don’t know,” it usually means we are missing a link in the chain. Perhaps we need to know which continent “Italy” is in.
\(R^3\) employs a “Missing Evidence Identification” (MEI) module. It looks at the unsatisfied condition and figures out what new entity to search for. It then expands the graph and repeats the process.

This creates a verifiable loop. The system hops from entity to entity, guided by the commonsense axiom, until it either proves/disproves the claim or runs out of steps.
Experimental Results
The researchers tested \(R^3\) against several strong baselines, including KAPING (a retrieval-augmented generation model), KGR (which retrofits claims to KGs), and standard Chain-of-Thought (CoT) prompting with GPT-3.5.
They used three distinct tasks to evaluate performance:
- Question Answering: Yes/No questions.
- Claim Verification: Checking if a statement is true.
- Preference Matching: A personalized recommendation task.
The “Long-Tail” Stress Test
A key contribution of this paper is how they stress-tested the models. They took standard datasets (like StrategyQA and Creak) and modified them. They swapped out famous entities for obscure, “long-tail” entities.
For example, instead of asking about Abraham Lincoln, they might ask about a minor historical figure.

This modification (Table 3) is crucial because LLMs often memorize facts about famous people (Lincoln) during pre-training, masking their inability to actually reason using the Knowledge Graph. By using obscure entities, the researchers forced the models to rely on the provided data.
Accuracy and Hallucination
The results were compelling. Table 2 below shows the comparison across Question Answering and Claim Verification.

Key Takeaways from the Data:
- FActScore (Factual Precision): Look at the FActScore columns. \(R^3\) achieves nearly perfect scores (0.97 - 0.98), significantly higher than the baselines (which hover around 0.60 - 0.70). This means \(R^3\) almost never hallucinates facts.
- Long-Tail Robustness: When moving from “Original” (famous) to “Long-Tail” (obscure) queries, standard CoT (Chain of Thought) performance drops off a cliff. \(R^3\), however, maintains its high accuracy. This proves that \(R^3\) is actually reading the Knowledge Graph, while the other models were largely relying on memorization.
- Reasoning Score: Human evaluators found that \(R^3\)’s reasoning steps were logical and faithful to the data much more often than the competitors.
Personalized Preference Matching
The team also tested \(R^3\) on a “Preference Matching” task—imagine a user asking for a recipe that fits their taste but doesn’t violate their specific medical allergies stored in a personal Knowledge Graph.

As shown in Table 4, \(R^3\) achieved 57% accuracy compared to KAPING’s 44%. Even more telling is the “Accuracy of Reasons”—\(R^3\) provided the correct justification 70% of the time, while KAPING only made sense 31.8% of the time. This highlights the value of the explicit axiom generation.
Why Every Component Matters
The researchers didn’t just stop at the final results; they analyzed why it worked. They performed ablation studies (removing parts of the system to see what breaks).
1. Entity Extraction: They found that using only a standard entity linker or only an LLM was insufficient. The combination (Union) used in \(R^3\) ensured they didn’t miss the starting point of the search.

2. The Power of Iteration: Does the tree search really help? The table below shows accuracy as the search depth increases.

At Depth 0 (no multi-hop search), accuracy is poor. As the model is allowed to “hop” (Depth 1 and 2), accuracy jumps significantly. This confirms that the iterative “Missing Evidence Identification” is vital for complex questions.
3. Pruning: Finally, they showed that smart pruning (semantic similarity) is better than just truncating the text. If you blindly cut off facts to fit the context window, you lose the answer.

Conclusion: Bridging the Gap
The “Right for Right Reasons” (\(R^3\)) framework represents a significant step forward in making AI reliable. By marrying the structured, factual nature of Knowledge Graphs with the flexible, commonsense reasoning of LLMs, we get the best of both worlds.
The verifiable nature of \(R^3\) is its strongest asset. In fields like healthcare, law, or finance, we cannot afford black-box guesses. We need systems that say: “I believe the answer is Yes, because Rule X applies, and I found Fact Y in the database to prove it.”
While \(R^3\) is more computationally intensive than a simple prompt (due to the iterative search and multiple LLM calls), it offers a blueprint for the future of robust AI: systems that don’t just know the answer, but understand why it is the answer.
](https://deep-paper.org/en/paper/2403.01390/images/cover.png)