Introduction

Imagine you read the sentence: “Mrs. Thompson gives her children some pasta.”

As a human, your brain instantly performs a feat of abstraction. You understand that “pasta” is a type of “food.” Because you know she is giving “food,” you can infer a consequence: “The children are well-fed.”

However, if you only viewed “pasta” as a physical object (an “entity”), the inference changes. If Mrs. Thompson gives an “entity,” the only safe inference is that the children “received an entity.” The nuance of feeding and being full is lost.

This ability to switch between concept levels—seeing “pasta” as food versus object versus carbohydrate—is called abstract inference. While humans do this effortlessly, Pre-trained Language Models (PLMs) struggle with it. They often get stuck on specific words or fail to grasp the hierarchical nature of concepts.

In this post, we’ll dive into a research paper that tackles this problem head-on. The researchers introduce HiCon-EG (Hierarchical Conceptual Entailment Graph), a framework designed to help AI models understand the “Concept Pyramid.” We will explore how they use complex sentence simplification, entropy-based concept selection, and graph learning to teach machines common sense.

Figure 1: After conceptualizing events, models can infer more information. For example, when pasta is conceptualized as food, we can infer that PersonY Be.full.

The Problem: Missing the Forest for the Trees

Current Natural Language Inference (NLI) methods often rely on Entailment Graphs. These are networks where nodes are verbs (predicates) and edges represent logical relationships (if you sprint, you imply run).

While useful, traditional entailment graphs have two major blind spots:

  1. Polysemy: Words have multiple meanings. “Apple” can be a fruit or a tech company. Treating “Apple” as a single node confuses the model.
  2. Lack of Hierarchy: Existing models often map arguments to a limited set of fixed types (like just “Person” or “Location”). They miss the rich hierarchy of language. They don’t know that an “iPhone” is a “phone,” which is an “electronic device,” which is an “artifact.”

Without understanding these layers, models cannot make accurate abstract inferences. They miss the connection between “eating pasta” and “being full” because they don’t treat “pasta” specifically as “food.”

The Solution: HiCon-EG

The researchers propose the HiCon-EG framework. The core idea is to build a “conceptual pyramid” that organizes arguments hierarchically. This allows the model to learn entailment relations at different levels of granularity.

The process is broken down into four distinct steps, visualized below:

Figure 2: The summary of constructing the hierarchical concept entailment graph.

Let’s break down these steps to understand the engineering behind the framework.

Step 1: Complex-to-Simple Open Information Extraction (C2S-OIE)

Real-world text, especially from news sources, is messy and complex. A sentence might say: “Bob, who was the last student to leave the lab, forgot his keys.”

Standard extraction tools struggle with nested structures like this. They might mistakenly think “the student” forgot the keys rather than “Bob.”

To fix this, the authors introduce a Complex-to-Simple (C2S) approach. They use Large Language Models (LLMs) to decompose complex sentences into simple, atomic statements before processing them.

The Hallucination Hurdle

Using LLMs directly for this task is risky because they can “hallucinate,” or make up words that weren’t in the original text.

Figure 3: An example of the results of LLM simplifying complex sentences showing hallucinations.

To solve this, the researchers used a distillation technique. They prompted an LLM (LLaMA2) to simplify sentences, filtered the results for high quality, and then trained a smaller, more stable BERT model to perform the simplification reliably.

The prompt used to teach the model looks like this:

Table 6: The C2S process prompt.

Once the sentences are simplified, the model identifies the arguments (subjects and objects) more accurately using Semantic Role Labeling (SRL).

Figure 5: An example of the sentence simplification dataset where the model marks arguments related to the verb.

Step 2: Hierarchical Concept Building

Now that the model has extracted the core arguments (e.g., “toast”, “house”), it needs to understand what they are.

The framework uses external knowledge bases—WordNet and Wikidata—to map each word to a hierarchy of concepts.

  • Toast \(\rightarrow\) Bread \(\rightarrow\) Food \(\rightarrow\) Entity
  • Toast \(\rightarrow\) Activity (as in “a toast at a wedding”)

This creates a Concept Pyramid. Instead of just seeing the word “toast,” the model now sees a list of potential categories it belongs to.

Step 3: Entropy-Based Concept Selection

This is perhaps the most innovative part of the paper. We now have too many concepts. If we map “toast” to “Entity,” it’s too vague. If we map it to “whole wheat toast,” it’s too specific to be useful for general reasoning. We need the “Goldilocks” concept—just right.

The researchers use an Entropy-based approach to select the best concept.

The Intuition

If a verb like “eat” is frequently paired with “apple,” “banana,” and “bread,” the model looks for the common denominator.

  • “Entity” is common to all, but it’s also common to “car” and “planet” (which you don’t eat).
  • “Food” is common to all the eaten items and excludes cars.

The model calculates Entropy (\(H\)) to measure how “pure” or consistent a concept is for a specific verb role. A lower entropy means the concept is a strong predictor for that verb.

Equation 1: Entropy Calculation

The Depth Penalty

However, minimizing entropy alone isn’t enough. The most specific concept (e.g., “Granny Smith Apple”) has very low entropy, but it doesn’t generalize well. To prevent the model from picking concepts that are too specific or too broad, they introduce a Hierarchical Depth score (\(ds\)).

Equation 2: Hierarchical Depth Score

This term acts as a regularizer. It penalizes concepts that are too high up the tree (too abstract) or too deep (too specific), depending on the weighting.

The final objective function combines these two elements using a weighting parameter \(\lambda\):

Equation 3: Final Objective Function combining Entropy and Depth

This formula allows the model to balance consistency (Entropy) with granularity (Depth), ensuring the selected concept is semantically accurate for the context.

Step 4: Learning Entailment Graphs

Finally, with clean, simplified sentences and accurately selected concepts, the model builds the entailment graph.

It links events based on Distributional Inclusion. If the set of contexts where PersonX gives PersonY Food appears is a subset of the contexts where PersonY receives Food appears, the model infers an entailment relationship:

PersonX give Food \(\models\) PersonY receive Food

Because the arguments are now “typed” correctly (e.g., we know we are talking about Food, not just Entity), the relationships in the graph are much more precise.

Experiments and Results

Does HiCon-EG actually improve the model’s ability to understand abstract concepts? The researchers tested this on several benchmarks.

Dataset Construction: ABSPYRAMID

They utilized the ABSPYRAMID benchmark, a dataset designed to test abstraction ability. They filtered and re-divided it to create a specific test set called ABS-HC (Hierarchical Concept) to rigorously test their method.

Table 2: Statistical results of the ABSPYRAMID dataset.

Abstraction Detection Performance

The primary task was Abstraction Detection: determining whether a premise entails a hypothesis at a higher level of abstraction (e.g., Premise: People are described using behaviors \(\rightarrow\) Hypothesis: People are described using traits).

Figure 6: Examples of the ABSPYRAMID dataset premise and hypothesis.

The results were compelling. HiCon-EG, when applied to standard models like RoBERTa and DeBERTa, consistently improved performance.

In the Levy/Holt dataset (a standard entailment benchmark), HiCon-EG also demonstrated strong results, outperforming baseline models in Accuracy and AUC (Area Under the Curve).

Table 9: The results of HiCon-EG on the Levy/Holt dataset.

Conceptualized Commonsense Reasoning

One of the most exciting applications of this work is in Commonsense Reasoning. Can the model understand social dynamics and cause-and-effect?

Using the AbstractATOMIC dataset, the researchers compared HiCon-EG against state-of-the-art models, including massive LLMs like LLaMA-2 (13B).

Table 7: Performance on AbstractATOMIC. HiCon-EG outperforms mainstream LLMs.

As shown in the table above, HiCon-EG (applied to DeBERTa-v3-large) achieved the highest Accuracy (84.53%) and AUC (80.15%), beating even the much larger 13-billion parameter LLaMA model. This proves that structured knowledge (via the entailment graph) can be more powerful than raw size.

Human Evaluation of Concept Selection

To verify that the “Entropy + Depth” formula was picking the right concepts, human annotators checked the output. They categorized the selected concepts as Coarse, Medium, or Fine-grained.

Figure 4: Human evaluation results of hierarchical concept selection.

The chart above shows that by adjusting the \(\lambda\) parameter (the balance between entropy and depth), the model’s behavior changes predictably. At \(\lambda=0.2\), the model achieves a “sweet spot,” selecting moderate granularity concepts (the grey line) most of the time—similar to how GPT-4 performs, but with a specialized, more efficient mechanism.

Conclusion and Implications

The HiCon-EG paper highlights a crucial limitation in current AI: bigger isn’t always smarter. While Large Language Models have vast amounts of data, they often lack the structured understanding of how concepts relate to one another hierarchically.

By explicitly modeling these relationships—breaking sentences down, building concept pyramids, and mathematically selecting the right level of abstraction—HiCon-EG allows models to:

  1. Generalize better: Understanding that “pasta” is “food” allows inferences that apply to all food, not just pasta.
  2. Reason more accurately: It avoids false equivalencies caused by polysemous words.
  3. Perform efficiently: It achieves state-of-the-art results using smaller base models (like DeBERTa) compared to massive LLMs.

This work serves as a reminder that inference is the bridge between raw data and true conceptual understanding. By teaching models to climb the “Concept Pyramid,” we bring them one step closer to human-like common sense.